Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into prepare_pserver_executor

7 years ago · 1f6e0448bc
parent 44c29abdbf 49313d4066
commit 1f6e0448bc
117 changed files with 8285 additions and 556 deletions
--- a/benchmark/cluster/README.md
+++ b/benchmark/cluster/README.md
@ -36,11 +36,41 @@
 - Trainer Count: 100
 - Metrics: mini-batch / sec
-| Batch Size | 32 | 64 | 128 | 256 |
+
-| -- | -- | -- | -- | -- |
+<table>
-| PaddlePaddle Fluid | - | - | - | - |
+<thead>
-| PaddlePaddle v2 | - | - | - | - |
+<tr>
-| TensorFlow | - | - | - | - |
+<th>Batch Size </th>
 <th> 32</th>
 <th>64</th>
 <th>128 </th>
 <th>256</th>
 </tr>
 </thead>
 <tbody>
 <tr>
 <td> PaddlePaddle Fluid</td>
 <td>-</td>
 <td>- </td>
 <td>-  </td>
 <td>- </td>
 </tr>
 <tr>
 <td>PaddlePaddle v2  </td>
 <td>-  </td>
 <td>- </td>
 <td>-  </td>
 <td>- </td>
 </tr>
 <tr>
 <td>TensorFlow </td>
 <td>-  </td>
 <td>- </td>
 <td>-  </td>
 <td>- </td>
 </tr>
 </tbody>
 </table>
 ### Measure the Performance for Different PServer Count
@ -48,11 +78,41 @@
 - Batch Size: 64
 - Metrics: mini-batch / sec
-| PServer Count | 10 | 20 | 40 | 60 |
+
-| -- | -- | -- | -- | -- |
+<table>
-| PaddlePaddle Fluid | - | - | - | - |
+<thead>
-| PaddlePaddle v2 | - | - | - | - |
+<tr>
-| TensorFlow | - | - | - | - |
+<th>PServer Count  </th>
 <th>10</th>
 <th>20</th>
 <th>40 </th>
 <th>60</th>
 </tr>
 </thead>
 <tbody>
 <tr>
 <td> PaddlePaddle Fluid</td>
 <td>-</td>
 <td>- </td>
 <td>-  </td>
 <td>- </td>
 </tr>
 <tr>
 <td>PaddlePaddle v2  </td>
 <td>-  </td>
 <td>- </td>
 <td>-  </td>
 <td>- </td>
 </tr>
 <tr>
 <td>TensorFlow </td>
 <td>-  </td>
 <td>- </td>
 <td>-  </td>
 <td>- </td>
 </tr>
 </tbody>
 </table>
 ### Measure Parallel Efficiency By Increasing Trainer Count
@ -67,11 +127,69 @@ The parallel efficiency is:
 $E = \div(S, N)$
-| Trainer Counter | 1 | 10 | 20 | 30 | 40 | 50 | 60 | 70 | 80 | 90 | 100 |
+<table>
-| -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- |
+<thead>
-| PaddlePaddle Fluid | - | - | - | - | - | - | - | - | - | - | - |
+<tr>
-| PaddlePaddle v2 | - | - | - | - | - | - | - | - | - | - | - | - |
+<th>Trainer Counter  </th>
-| TensorFlow | - | - | - | - | - | - | - | - | - | - | - | - | - |
+<th>1</th>
 <th>10</th>
 <th>20 </th>
 <th>30</th>
 <th>40</th>
 <th>50</th>
 <th>60 </th>
 <th>70</th>
 <th>80</th>
 <th>90</th>
 <th>100 </th>
 </tr>
 </thead>
 <tbody>
 <tr>
 <td> PaddlePaddle Fluid</td>
 <td>-</td>
 <td>- </td>
 <td>- </td>
 <td>- </td>
 <td>-</td>
 <td>- </td>
 <td>- </td>
 <td>- </td>
 <td>-</td>
 <td>- </td>
 <td>- </td>
 </tr>
 <tr>
 <td>PaddlePaddle v2  </td>
 <td>-  </td>
 <td>- </td>
 <td>-  </td>
 <td>- </td>
 <td>-</td>
 <td>- </td>
 <td>- </td>
 <td>- </td>
 <td>-</td>
 <td>- </td>
 <td>- </td>
 </tr>
 <tr>
 <td>TensorFlow </td>
 <td>-  </td>
 <td>- </td>
 <td>-  </td>
 <td>- </td>
 <td>-</td>
 <td>- </td>
 <td>- </td>
 <td>- </td>
 <td>-</td>
 <td>- </td>
 <td>- </td>
 </tr>
 </tbody>
 </table>
 ## Reproduce the benchmark
--- a/benchmark/cluster/vgg16/README.md
+++ b/benchmark/cluster/vgg16/README.md
@ -16,11 +16,41 @@ Setting environment variable: `MKL_NUM_THREADS=1`.
 - Metrics: samples / sec
-| Batch Size | 32 | 64 | 128 | 256 |
+<table>
-| -- | -- | -- | -- | -- |
+<thead>
-| PaddlePaddle Fluid | 15.44 | 16.32 | 16.74 | 16.79 |
+<tr>
-| PaddlePaddle v2 | 15.97 | 17.04 | 17.60 | 17.83 |
+<th>Batch Size </th>
-| TensorFlow | 9.09 | 9.10 | 9.24 | 8.66 |
+<th> 32</th>
 <th>64</th>
 <th>128 </th>
 <th>256</th>
 </tr>
 </thead>
 <tbody>
 <tr>
 <td> PaddlePaddle Fluid</td>
 <td> 15.44 </td>
 <td> 16.32 </td>
 <td> 16.74 </td>
 <td> 16.79 </td>
 </tr>
 <tr>
 <td>PaddlePaddle v2  </td>
 <td> 15.97 </td>
 <td> 17.04 </td>
 <td> 17.60 </td>
 <td> 17.83 </td>
 </tr>
 <tr>
 <td>TensorFlow </td>
 <td> 9.09 </td>
 <td> 9.10 </td>
 <td> 9.24 </td>
 <td> 8.66 </td>
 </tr>
 </tbody>
 </table>
 ### Different Batch Size
@ -28,12 +58,40 @@ Setting environment variable: `MKL_NUM_THREADS=1`.
 - Trainer Count: 20
 - Metrics: samples / sec
-| Batch Size | 32 | 64 | 128 | 256 |
+<table>
-| -- | -- | -- | -- | -- |
+<thead>
-| PaddlePaddle Fluid | 190.20 | 222.15 | 247.40 | 258.18 |
+<tr>
-| PaddlePaddle v2 | 170.96 | 233.71 | 256.14 | 329.23 |
+<th>Batch Size </th>
-| TensorFlow | - | - | - | - |
+<th> 32</th>
-
+<th>64</th>
 <th>128 </th>
 <th>256</th>
 </tr>
 </thead>
 <tbody>
 <tr>
 <td> PaddlePaddle Fluid</td>
 <td> 190.20 </td>
 <td> 222.15 </td>
 <td> 247.40 </td>
 <td> 258.18 </td>
 </tr>
 <tr>
 <td>PaddlePaddle v2  </td>
 <td> 170.96 </td>
 <td> 233.71 </td>
 <td> 256.14 </td>
 <td> 329.23 </td>
 </tr>
 <tr>
 <td>TensorFlow </td>
 <td> - </td>
 <td> - </td>
 <td> - </td>
 <td> - </td>
 </tr>
 </tbody>
 </table>
 ### Accelerate Rate
@ -41,11 +99,41 @@ Setting environment variable: `MKL_NUM_THREADS=1`.
 - Batch Size: 128
 - Metrics: samples / sec
-| Trainer Count | 20 | 40 | 80 | 100 |
+<table>
-| -- | -- | -- | -- | -- |
+<thead>
-| PaddlePaddle Fluid | 263.29 (78.64%) | 518.80 (77.47%) | 836.26 (62.44%) | 1019.29 (60.89%) |
+<tr>
-| PaddlePaddle v2 (need more tests) | 326.85 (92.85%) | 534.58 (75.93%) | 853.30 (60.60%) | 1041.99 (59.20%) |
+<th>Trainer Count </th>
-| TensorFlow | - | - | - | - |
+<th>20</th>
 <th>40</th>
 <th>80</th>
 <th>100</th>
 </tr>
 </thead>
 <tbody>
 <tr>
 <td> PaddlePaddle Fluid</td>
 <td> 263.29 (78.64%) </td>
 <td> 518.80 (77.47%) </td>
 <td> 836.26 (62.44%) </td>
 <td> 1019.29 (60.89%) </td>
 </tr>
 <tr>
 <td>PaddlePaddle v2 (need more tests)   </td>
 <td> 326.85 (92.85%) </td>
 <td> 534.58 (75.93%) </td>
 <td> 853.30 (60.60%) </td>
 <td> 1041.99 (59.20%) </td>
 </tr>
 <tr>
 <td>TensorFlow </td>
 <td> - </td>
 <td> - </td>
 <td> - </td>
 <td> - </td>
 </tr>
 </tbody>
 </table>
 ### Different Pserver Count
@ -53,11 +141,41 @@ Setting environment variable: `MKL_NUM_THREADS=1`.
 - Batch Size: 128
 - Metrics: samples/ sec
-| PServer Count | 3 | 6 |10 | 20 |
+<table>
-| -- | -- | -- | -- | -- |
+<thead>
-| PaddlePaddle Fluid(should fix in next PR) | 589.1 | 592.6 | 656.4 | 655.8 |
+<tr>
-| PaddlePaddle v2 | 593.4 | 791.3 | 729.7 | 821.7 |
+<th>PServer Count </th>
-| TensorFlow | - | - | - | - |
+<th>3</th>
 <th>6</th>
 <th>10</th>
 <th>20</th>
 </tr>
 </thead>
 <tbody>
 <tr>
 <td> PaddlePaddle Fluid(should fix in next PR) </td>
 <td> 589.1 </td>
 <td> 592.6 </td>
 <td> 656.4 </td>
 <td> 655.8 </td>
 </tr>
 <tr>
 <td>PaddlePaddle v2 (need more tests)   </td>
 <td> 593.4 </td>
 <td> 791.3 </td>
 <td> 729.7 </td>
 <td> 821.7 </td>
 </tr>
 <tr>
 <td>TensorFlow </td>
 <td> - </td>
 <td> - </td>
 <td> - </td>
 <td> - </td>
 </tr>
 </tbody>
 </table>
 *The performance gap between Fuild and v2 comes from the network interference.*
--- a/benchmark/fluid/machine_translation.py
+++ b/benchmark/fluid/machine_translation.py
--- a/benchmark/fluid/mnist.py
+++ b/benchmark/fluid/mnist.py
@ -0,0 +1,205 @@
 #   Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 import numpy as np
 import argparse
 import time
 import paddle.v2 as paddle
 import paddle.fluid as fluid
 import paddle.fluid.profiler as profiler
 SEED = 1
 DTYPE = "float32"
 # random seed must set before configuring the network.
 # fluid.default_startup_program().random_seed = SEED
 def parse_args():
    parser = argparse.ArgumentParser("mnist model benchmark.")
    parser.add_argument(
        '--batch_size', type=int, default=128, help='The minibatch size.')
    parser.add_argument(
        '--iterations', type=int, default=35, help='The number of minibatches.')
    parser.add_argument(
        '--pass_num', type=int, default=5, help='The number of passes.')
    parser.add_argument(
        '--device',
        type=str,
        default='GPU',
        choices=['CPU', 'GPU'],
        help='The device type.')
    parser.add_argument(
        '--infer_only', action='store_true', help='If set, run forward only.')
    parser.add_argument(
        '--use_cprof', action='store_true', help='If set, use cProfile.')
    parser.add_argument(
        '--use_nvprof',
        action='store_true',
        help='If set, use nvprof for CUDA.')
    args = parser.parse_args()
    return args
 def print_arguments(args):
    vars(args)['use_nvprof'] = (vars(args)['use_nvprof'] and
                                vars(args)['device'] == 'GPU')
    print('-----------  Configuration Arguments -----------')
    for arg, value in sorted(vars(args).iteritems()):
        print('%s: %s' % (arg, value))
    print('------------------------------------------------')
 def cnn_model(data):
    conv_pool_1 = fluid.nets.simple_img_conv_pool(
        input=data,
        filter_size=5,
        num_filters=20,
        pool_size=2,
        pool_stride=2,
        act="relu")
    conv_pool_2 = fluid.nets.simple_img_conv_pool(
        input=conv_pool_1,
        filter_size=5,
        num_filters=50,
        pool_size=2,
        pool_stride=2,
        act="relu")
    # TODO(dzhwinter) : refine the initializer and random seed settting
    SIZE = 10
    input_shape = conv_pool_2.shape
    param_shape = [reduce(lambda a, b: a * b, input_shape[1:], 1)] + [SIZE]
    scale = (2.0 / (param_shape[0]**2 * SIZE))**0.5
    predict = fluid.layers.fc(
        input=conv_pool_2,
        size=SIZE,
        act="softmax",
        param_attr=fluid.param_attr.ParamAttr(
            initializer=fluid.initializer.NormalInitializer(
                loc=0.0, scale=scale)))
    return predict
 def eval_test(exe, batch_acc, batch_size_tensor, inference_program):
    test_reader = paddle.batch(
        paddle.dataset.mnist.test(), batch_size=args.batch_size)
    test_pass_acc = fluid.average.WeightedAverage()
    for batch_id, data in enumerate(test_reader()):
        img_data = np.array(map(lambda x: x[0].reshape([1, 28, 28]),
                                data)).astype(DTYPE)
        y_data = np.array(map(lambda x: x[1], data)).astype("int64")
        y_data = y_data.reshape([len(y_data), 1])
        acc, weight = exe.run(inference_program,
                              feed={"pixel": img_data,
                                    "label": y_data},
                              fetch_list=[batch_acc, batch_size_tensor])
        test_pass_acc.add(value=acc, weight=weight)
        pass_acc = test_pass_acc.eval()
    return pass_acc
 def run_benchmark(model, args):
    if args.use_cprof:
        pr = cProfile.Profile()
        pr.enable()
    start_time = time.time()
    # Input data
    images = fluid.layers.data(name='pixel', shape=[1, 28, 28], dtype=DTYPE)
    label = fluid.layers.data(name='label', shape=[1], dtype='int64')
    # Train program
    predict = model(images)
    cost = fluid.layers.cross_entropy(input=predict, label=label)
    avg_cost = fluid.layers.mean(x=cost)
    # Evaluator
    batch_size_tensor = fluid.layers.create_tensor(dtype='int64')
    batch_acc = fluid.layers.accuracy(
        input=predict, label=label, total=batch_size_tensor)
    # inference program
    inference_program = fluid.default_main_program().clone()
    with fluid.program_guard(inference_program):
        inference_program = fluid.io.get_inference_program(
            target_vars=[batch_acc, batch_size_tensor])
    # Optimization
    opt = fluid.optimizer.AdamOptimizer(
        learning_rate=0.001, beta1=0.9, beta2=0.999)
    opt.minimize(avg_cost)
    fluid.memory_optimize(fluid.default_main_program())
    # Initialize executor
    place = fluid.CPUPlace() if args.device == 'CPU' else fluid.CUDAPlace(0)
    exe = fluid.Executor(place)
    # Parameter initialization
    exe.run(fluid.default_startup_program())
    # Reader
    train_reader = paddle.batch(
        paddle.dataset.mnist.train(), batch_size=args.batch_size)
    accuracy = fluid.average.WeightedAverage()
    for pass_id in range(args.pass_num):
        accuracy.reset()
        pass_start = time.time()
        for batch_id, data in enumerate(train_reader()):
            img_data = np.array(
                map(lambda x: x[0].reshape([1, 28, 28]), data)).astype(DTYPE)
            y_data = np.array(map(lambda x: x[1], data)).astype("int64")
            y_data = y_data.reshape([len(y_data), 1])
            start = time.time()
            outs = exe.run(
                fluid.default_main_program(),
                feed={"pixel": img_data,
                      "label": y_data},
                fetch_list=[avg_cost, batch_acc, batch_size_tensor]
            )  # The accuracy is the accumulation of batches, but not the current batch.
            accuracy.add(value=outs[1], weight=outs[2])
            end = time.time()
            loss = np.array(outs[0])
            acc = np.array(outs[1])
            print("pass=%d, batch=%d, loss=%f, error=%f, elapse=%f" %
                  (pass_id, batch_id, loss, 1 - acc, (end - start) / 1000))
        pass_end = time.time()
        train_avg_acc = accuracy.eval()
        test_avg_acc = eval_test(exe, batch_acc, batch_size_tensor,
                                 inference_program)
        print("pass=%d, train_avg_acc=%f, test_avg_acc=%f, elapse=%f" %
              (pass_id, train_avg_acc, test_avg_acc,
               (pass_end - pass_start) / 1000))
 if __name__ == '__main__':
    args = parse_args()
    print_arguments(args)
    if args.use_nvprof and args.device == 'GPU':
        with profiler.cuda_profiler("cuda_profiler.txt", 'csv') as nvprof:
            run_benchmark(cnn_model, args)
    else:
        run_benchmark(cnn_model, args)
--- a/benchmark/fluid/resnet.py
+++ b/benchmark/fluid/resnet.py
--- a/benchmark/fluid/run.sh
+++ b/benchmark/fluid/run.sh
@ -0,0 +1,49 @@
 #!/bin/bash
 # This script benchmarking the PaddlePaddle Fluid on
 # single thread single GPU.
 export CUDNN_PATH=/paddle/cudnn_v5/cuda/lib
 # disable openmp and mkl parallel
 #https://github.com/PaddlePaddle/Paddle/issues/7199
 export MKL_NUM_THREADS=1
 export OMP_NUM_THREADS=1
 ht=`lscpu |grep "per core"|awk -F':' '{print $2}'|xargs`
 if [ $ht -eq 1 ]; then # HT is OFF
    if [ -z "$KMP_AFFINITY" ]; then
        export KMP_AFFINITY="granularity=fine,compact,0,0"
    fi
    if [ -z "$OMP_DYNAMIC" ]; then
        export OMP_DYNAMIC="FALSE"
    fi
 else # HT is ON
    if [ -z "$KMP_AFFINITY" ]; then
        export KMP_AFFINITY="granularity=fine,compact,1,0"
    fi
 fi
 # disable multi-gpu if have more than one
 export CUDA_VISIBLE_DEVICES=0
 export LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH
 export LD_LIBRARY_PATH=$CUDNN_PATH:$LD_LIBRARY_PATH
 # vgg16
 # cifar10 gpu cifar10 128
 FLAGS_benchmark=true python fluid/vgg.py \
               --device=GPU \
               --batch_size=128 \
               --skip_batch_num=5 \
               --iterations=30  \
               2>&1 > vgg16_gpu_128.log
 # resnet50
 # resnet50 gpu cifar10 128
 FLAGS_benchmark=true python fluid/resnet.py \
               --device=GPU \
               --batch_size=128 \
               --data_set=cifar10 \
               --model=resnet_cifar10 \
               --skip_batch_num=5 \
               --iterations=30 \
               2>&1 > resnet50_gpu_128.log
 # lstm
--- a/benchmark/fluid/stacked_dynamic_lstm.py
+++ b/benchmark/fluid/stacked_dynamic_lstm.py
@ -0,0 +1,209 @@
 #   Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 import argparse
 import cPickle
 import os
 import random
 import time
 import numpy
 import paddle.v2 as paddle
 import paddle.v2.dataset.imdb as imdb
 import paddle.fluid as fluid
 from paddle.v2 import batch
 import paddle.fluid.profiler as profiler
 def parse_args():
    parser = argparse.ArgumentParser("Understand Sentiment by Dynamic RNN.")
    parser.add_argument(
        '--batch_size',
        type=int,
        default=32,
        help='The sequence number of a batch data. (default: %(default)d)')
    parser.add_argument(
        '--emb_dim',
        type=int,
        default=512,
        help='Dimension of embedding table. (default: %(default)d)')
    parser.add_argument(
        '--hidden_dim',
        type=int,
        default=512,
        help='Hidden size of lstm unit. (default: %(default)d)')
    parser.add_argument(
        '--pass_num',
        type=int,
        default=100,
        help='Epoch number to train. (default: %(default)d)')
    parser.add_argument(
        '--device',
        type=str,
        default='CPU',
        choices=['CPU', 'GPU'],
        help='The device type.')
    parser.add_argument(
        '--crop_size',
        type=int,
        default=int(os.environ.get('CROP_SIZE', '1500')),
        help='The max sentence length of input. Since this model use plain RNN,'
        ' Gradient could be explored if sentence is too long')
    args = parser.parse_args()
    return args
 word_dict = imdb.word_dict()
 def crop_sentence(reader, crop_size):
    unk_value = word_dict['<unk>']
    def __impl__():
        for item in reader():
            if len([x for x in item[0] if x != unk_value]) < crop_size:
                yield item
    return __impl__
 def main():
    args = parse_args()
    lstm_size = args.hidden_dim
    data = fluid.layers.data(
        name="words", shape=[1], lod_level=1, dtype='int64')
    sentence = fluid.layers.embedding(
        input=data, size=[len(word_dict), args.emb_dim])
    sentence = fluid.layers.fc(input=sentence, size=lstm_size, act='tanh')
    rnn = fluid.layers.DynamicRNN()
    with rnn.block():
        word = rnn.step_input(sentence)
        prev_hidden = rnn.memory(value=0.0, shape=[lstm_size])
        prev_cell = rnn.memory(value=0.0, shape=[lstm_size])
        def gate_common(
                ipt,
                hidden,
                size, ):
            gate0 = fluid.layers.fc(input=ipt, size=size, bias_attr=True)
            gate1 = fluid.layers.fc(input=hidden, size=size, bias_attr=False)
            gate = fluid.layers.sums(input=[gate0, gate1])
            return gate
        forget_gate = fluid.layers.sigmoid(
            x=gate_common(word, prev_hidden, lstm_size))
        input_gate = fluid.layers.sigmoid(
            x=gate_common(word, prev_hidden, lstm_size))
        output_gate = fluid.layers.sigmoid(
            x=gate_common(word, prev_hidden, lstm_size))
        cell_gate = fluid.layers.tanh(
            x=gate_common(word, prev_hidden, lstm_size))
        cell = fluid.layers.sums(input=[
            fluid.layers.elementwise_mul(
                x=forget_gate, y=prev_cell), fluid.layers.elementwise_mul(
                    x=input_gate, y=cell_gate)
        ])
        hidden = fluid.layers.elementwise_mul(
            x=output_gate, y=fluid.layers.tanh(x=cell))
        rnn.update_memory(prev_cell, cell)
        rnn.update_memory(prev_hidden, hidden)
        rnn.output(hidden)
    last = fluid.layers.sequence_pool(rnn(), 'last')
    logit = fluid.layers.fc(input=last, size=2, act='softmax')
    loss = fluid.layers.cross_entropy(
        input=logit,
        label=fluid.layers.data(
            name='label', shape=[1], dtype='int64'))
    loss = fluid.layers.mean(x=loss)
    # add acc
    batch_size_tensor = fluid.layers.create_tensor(dtype='int64')
    batch_acc = fluid.layers.accuracy(input=logit, label=fluid.layers.data(name='label', \
                shape=[1], dtype='int64'), total=batch_size_tensor)
    inference_program = fluid.default_main_program().clone()
    with fluid.program_guard(inference_program):
        inference_program = fluid.io.get_inference_program(
            target_vars=[batch_acc, batch_size_tensor])
    adam = fluid.optimizer.Adam()
    adam.minimize(loss)
    fluid.memory_optimize(fluid.default_main_program())
    place = fluid.CPUPlace() if args.device == 'CPU' else fluid.CUDAPlace(0)
    exe = fluid.Executor(place)
    exe.run(fluid.default_startup_program())
    def train_loop(pass_num, crop_size):
        with profiler.profiler(args.device, 'total') as prof:
            for pass_id in range(pass_num):
                train_reader = batch(
                    paddle.reader.shuffle(
                        crop_sentence(imdb.train(word_dict), crop_size),
                        buf_size=25000),
                    batch_size=args.batch_size)
                word_nums = 0
                pass_start_time = time.time()
                for batch_id, data in enumerate(train_reader()):
                    tensor_words = to_lodtensor([x[0] for x in data], place)
                    for x in data:
                        word_nums += len(x[0])
                    label = numpy.array([x[1] for x in data]).astype("int64")
                    label = label.reshape((-1, 1))
                    loss_np, acc, weight = exe.run(
                        fluid.default_main_program(),
                        feed={"words": tensor_words,
                              "label": label},
                        fetch_list=[loss, batch_acc, batch_size_tensor])
                    print("pass_id=%d, batch_id=%d, loss=%f, acc=%f" %
                          (pass_id, batch_id, loss_np, acc))
                pass_end_time = time.time()
                time_consumed = pass_end_time - pass_start_time
                words_per_sec = word_nums / time_consumed
                print("pass_id=%d, sec/pass: %f, words/s: %f" %
                      (pass_id, time_consumed, words_per_sec))
    train_loop(args.pass_num, args.crop_size)
 def to_lodtensor(data, place):
    seq_lens = [len(seq) for seq in data]
    cur_len = 0
    lod = [cur_len]
    for l in seq_lens:
        cur_len += l
        lod.append(cur_len)
    flattened_data = numpy.concatenate(data, axis=0).astype("int64")
    flattened_data = flattened_data.reshape([len(flattened_data), 1])
    res = fluid.LoDTensor()
    res.set(flattened_data, place)
    res.set_lod([lod])
    return res
 if __name__ == '__main__':
    main()
--- a/benchmark/fluid/vgg.py
+++ b/benchmark/fluid/vgg.py
@ -0,0 +1,220 @@
 #   Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """VGG16 benchmark in Fluid"""
 from __future__ import print_function
 import sys
 import time
 import numpy as np
 import paddle.v2 as paddle
 import paddle.fluid as fluid
 import paddle.fluid.core as core
 import argparse
 import functools
 parser = argparse.ArgumentParser(description=__doc__)
 parser.add_argument(
    '--batch_size', type=int, default=128, help="Batch size for training.")
 parser.add_argument(
    '--skip_batch_num',
    type=int,
    default=5,
    help='The first num of minibatch num to skip, for better performance test')
 parser.add_argument(
    '--iterations', type=int, default=80, help='The number of minibatches.')
 parser.add_argument(
    '--learning_rate',
    type=float,
    default=1e-3,
    help="Learning rate for training.")
 parser.add_argument('--pass_num', type=int, default=50, help="No. of passes.")
 parser.add_argument(
    '--device',
    type=str,
    default='GPU',
    choices=['CPU', 'GPU'],
    help="The device type.")
 parser.add_argument(
    '--data_format',
    type=str,
    default='NCHW',
    choices=['NCHW', 'NHWC'],
    help='The data order, now only support NCHW.')
 parser.add_argument(
    '--data_set',
    type=str,
    default='cifar10',
    choices=['cifar10', 'flowers'],
    help='Optional dataset for benchmark.')
 parser.add_argument(
    '--with_test',
    action='store_true',
    help='If set, test the testset during training.')
 args = parser.parse_args()
 def vgg16_bn_drop(input):
    def conv_block(input, num_filter, groups, dropouts):
        return fluid.nets.img_conv_group(
            input=input,
            pool_size=2,
            pool_stride=2,
            conv_num_filter=[num_filter] * groups,
            conv_filter_size=3,
            conv_act='relu',
            conv_with_batchnorm=True,
            conv_batchnorm_drop_rate=dropouts,
            pool_type='max')
    conv1 = conv_block(input, 64, 2, [0.3, 0])
    conv2 = conv_block(conv1, 128, 2, [0.4, 0])
    conv3 = conv_block(conv2, 256, 3, [0.4, 0.4, 0])
    conv4 = conv_block(conv3, 512, 3, [0.4, 0.4, 0])
    conv5 = conv_block(conv4, 512, 3, [0.4, 0.4, 0])
    drop = fluid.layers.dropout(x=conv5, dropout_prob=0.5)
    fc1 = fluid.layers.fc(input=drop, size=512, act=None)
    bn = fluid.layers.batch_norm(input=fc1, act='relu')
    drop2 = fluid.layers.dropout(x=bn, dropout_prob=0.5)
    fc2 = fluid.layers.fc(input=drop2, size=512, act=None)
    return fc2
 def main():
    if args.data_set == "cifar10":
        classdim = 10
        if args.data_format == 'NCHW':
            data_shape = [3, 32, 32]
        else:
            data_shape = [32, 32, 3]
    else:
        classdim = 102
        if args.data_format == 'NCHW':
            data_shape = [3, 224, 224]
        else:
            data_shape = [224, 224, 3]
    # Input data
    images = fluid.layers.data(name='pixel', shape=data_shape, dtype='float32')
    label = fluid.layers.data(name='label', shape=[1], dtype='int64')
    # Train program
    net = vgg16_bn_drop(images)
    predict = fluid.layers.fc(input=net, size=classdim, act='softmax')
    cost = fluid.layers.cross_entropy(input=predict, label=label)
    avg_cost = fluid.layers.mean(x=cost)
    # Evaluator
    batch_size_tensor = fluid.layers.create_tensor(dtype='int64')
    batch_acc = fluid.layers.accuracy(
        input=predict, label=label, total=batch_size_tensor)
    # inference program
    inference_program = fluid.default_main_program().clone()
    with fluid.program_guard(inference_program):
        inference_program = fluid.io.get_inference_program(
            target_vars=[batch_acc, batch_size_tensor])
    # Optimization
    optimizer = fluid.optimizer.Adam(learning_rate=args.learning_rate)
    opts = optimizer.minimize(avg_cost)
    fluid.memory_optimize(fluid.default_main_program())
    # Initialize executor
    place = core.CPUPlace() if args.device == 'CPU' else core.CUDAPlace(0)
    exe = fluid.Executor(place)
    # Parameter initialization
    exe.run(fluid.default_startup_program())
    # data reader
    train_reader = paddle.batch(
        paddle.reader.shuffle(
            paddle.dataset.cifar.train10()
            if args.data_set == 'cifar10' else paddle.dataset.flowers.train(),
            buf_size=5120),
        batch_size=args.batch_size)
    test_reader = paddle.batch(
        paddle.dataset.cifar.test10()
        if args.data_set == 'cifar10' else paddle.dataset.flowers.test(),
        batch_size=args.batch_size)
    # test
    def test(exe):
        test_accuracy = fluid.average.WeightedAverage()
        for batch_id, data in enumerate(test_reader()):
            img_data = np.array(map(lambda x: x[0].reshape(data_shape),
                                    data)).astype("float32")
            y_data = np.array(map(lambda x: x[1], data)).astype("int64")
            y_data = y_data.reshape([-1, 1])
            acc, weight = exe.run(inference_program,
                                  feed={"pixel": img_data,
                                        "label": y_data},
                                  fetch_list=[batch_acc, batch_size_tensor])
            test_accuracy.add(value=acc, weight=weight)
        return test_accuracy.eval()
    iters, num_samples, start_time = 0, 0, time.time()
    accuracy = fluid.average.WeightedAverage()
    for pass_id in range(args.pass_num):
        accuracy.reset()
        train_accs = []
        train_losses = []
        for batch_id, data in enumerate(train_reader()):
            if iters == args.skip_batch_num:
                start_time = time.time()
                num_samples = 0
            if iters == args.iterations:
                break
            img_data = np.array(map(lambda x: x[0].reshape(data_shape),
                                    data)).astype("float32")
            y_data = np.array(map(lambda x: x[1], data)).astype("int64")
            y_data = y_data.reshape([-1, 1])
            loss, acc, weight = exe.run(
                fluid.default_main_program(),
                feed={"pixel": img_data,
                      "label": y_data},
                fetch_list=[avg_cost, batch_acc, batch_size_tensor])
            accuracy.add(value=acc, weight=weight)
            iters += 1
            num_samples += len(data)
            print(
                "Pass = %d, Iter = %d, Loss = %f, Accuracy = %f" %
                (pass_id, iters, loss, acc)
            )  # The accuracy is the accumulation of batches, but not the current batch.
        pass_train_acc = accuracy.eval()
        train_losses.append(loss)
        train_accs.append(acc)
        # evaluation
        if args.with_test:
            pass_test_acc = test(exe)
        train_elapsed = time.time() - start_time
        print("Pass: %d, Loss: %f, Train Accuray: %f\n" %
              (pass_id, np.mean(train_losses), np.mean(train_accs)))
 def print_arguments():
    print('-----------  Configuration Arguments -----------')
    for arg, value in sorted(vars(args).iteritems()):
        print('%s: %s' % (arg, value))
    print('------------------------------------------------')
 if __name__ == "__main__":
    print_arguments()
    main()
--- a/doc/fluid/api/layers.rst
+++ b/doc/fluid/api/layers.rst
@ -494,6 +494,12 @@ reshape
 ..  autofunction:: paddle.fluid.layers.reshape
    :noindex:
 pad
 ---
 ..  autofunction:: paddle.fluid.layers.pad
    :noindex:
 scale
 -----
--- a/doc/fluid/design/algorithm/parameter_average.md
+++ b/doc/fluid/design/algorithm/parameter_average.md
@ -7,7 +7,7 @@ Polyak and Juditsky (1992) showed that the test performance of simple average of
 Hence, to accelerate the speed of Stochastic Gradient Descent, Averaged Stochastic Gradient Descent (ASGD) was proposed in Polyak and Juditsky (1992). For ASGD, the running average of parameters obtained by SGD, is used as the estimator for <img src="./images/theta_star.gif"/><br/> . The averaging is done as follows:
-<img src="./images/asgd.gif" align="center"/><br/>
+![](./images/asgd.gif)
 We propose averaging for any optimizer similar to how ASGD performs it, as mentioned above.
--- a/doc/fluid/design/concepts/README.md
+++ b/doc/fluid/design/concepts/README.md
@ -6,11 +6,33 @@ Here are some initial thoughts. Your comments are welcome!
 I think we need only the following few CMake functions to make a project description mean and clean:
-| C++ | CUDA C++ | Go |
+<table>
-|---|---|---|
+<thead>
-| cc_library | nv_library | go_library |
+<tr>
-| cc_binary | nv_binary | go_binary |
+<th>C++</th>
-| cc_test | nv_test | go_test |
+<th>CUDA C++</th>
 <th>Go</th>
 </tr>
 </thead>
 <tbody>
 <tr>
 <td>cc_library </td>
 <td>nv_library </td>
 <td>go_library </td>
 </tr>
 <tr>
 <td>cc_binary </td>
 <td>nv_binary </td>
 <td>go_binary </td>
 </tr>
 <tr>
 <td> cc_test </td>
 <td> nv_test </td>
 <td> go_test </td>
 </tr>
 </tbody>
 </table>
 - The `_library` functions generate  .a files from source code.
 - The `_binary` functions generate executable binary files.
--- a/doc/fluid/design/concepts/block.md
+++ b/doc/fluid/design/concepts/block.md
@ -14,11 +14,29 @@ In programming languages, a block is a pair of curly braces that includes local
 Blocks work with control flow structures like `if`, `else`, and `for`, which have equivalents in deep learning:
-| programming languages | PaddlePaddle          |
+<table>
-|-----------------------|-----------------------|
+<thead>
-| for, while loop       | RNN, WhileOp          |
+<tr>
-| if, if-else, switch   | IfElseOp, SwitchOp    |
+<th>programming languages</th>
-| sequential execution  | a sequence of layers  |
+<th>PaddlePaddle</th>
 </tr>
 </thead>
 <tbody>
 <tr>
 <td>for, while loop </td>
 <td>RNN, WhileOp </td>
 </tr>
 <tr>
 <td>if, if-else, switch </td>
 <td>IfElseOp, SwitchOp </td>
 </tr>
 <tr>
 <td>sequential execution </td>
 <td>a sequence of layers </td>
 </tr>
 </tbody>
 </table>
 A key difference is that a C++ program describes a one pass computation, whereas a deep learning program describes both the forward and backward passes.
@ -26,12 +44,33 @@ A key difference is that a C++ program describes a one pass computation, whereas
 The existence of the backward pass makes the execution of a block of PaddlePaddle different from traditional programs:
-| programming languages | PaddlePaddle                    |
+<table>
-|-----------------------|---------------------------------|
+<thead>
-| stack                 | scope hierarchy                 |
+<tr>
-| stack frame           | scope                           |
+<th>programming languages</th>
-| push at entering block| push at entering block          |
+<th>PaddlePaddle</th>
-| pop at leaving block  | destroy when minibatch completes|
+</tr>
 </thead>
 <tbody>
 <tr>
 <td>stack </td>
 <td>scope hierarchy </td>
 </tr>
 <tr>
 <td>stack frame  </td>
 <td>scope </td>
 </tr>
 <tr>
 <td>push at entering block </td>
 <td>push at entering block </td>
 </tr>
 <tr>
 <td>pop at leaving block </td>
 <td>destroy when minibatch completes </td>
 </tr>
 </tbody>
 </table>
 1. In traditional programs:
--- a/doc/fluid/design/concepts/functions_operators_layers.md
+++ b/doc/fluid/design/concepts/functions_operators_layers.md
@ -86,12 +86,40 @@ def layer.fc(X):
 We'd like to have Python bindings to operators in package `paddle.operator`, and Python compositions of operators in package `paddle.layer`.  So we have the following concepts in above illustrative example:
-
+<table>
-| C++ functions/functors | mul          | add          |             |          |
+<thead>
-|------------------------|--------------|--------------|-------------|----------|
+<tr>
-| C++ operator class     | mulOp        | addOp        | FCOp        |          |
+<th>C++ functions/functors</th>
-| Python binding         | operator.mul | operator.add | operator.fc |          |
+<th>mul</th>
-| Python function        |              |              |             | layer.fc |
+<th>add</th>
 <th></th>
 <th></th>
 </tr>
 </thead>
 <tbody>
 <tr>
 <td>C++ operator class </td>
 <td>mulOp</td>
 <td>addOp </td>
 <td>FCOp </td>
 <td></td>
 </tr>
 <tr>
 <td>Python binding  </td>
 <td>operator.mul</td>
 <td> operator.add </td>
 <td>operator.fc </td>
 <td></td>
 </tr>
 <tr>
 <td>Python function   </td>
 <td></td>
 <td></td>
 <td> </td>
 <td>layer.fc</td>
 </tr>
 </tbody>
 </table>
 This is how we differentiate layer and operators in PaddlePaddle:
--- a/doc/fluid/design/concepts/lod_tensor.md
+++ b/doc/fluid/design/concepts/lod_tensor.md
@ -2,12 +2,38 @@
 Like other deep learning systems, PaddlePaddle supports training models from sequence data.  Also, like other systems, PaddlePaddle represent a mini-batch of sequences as a Tensor.  What is different is that PaddlePaddle doesn't require all sequences in a mini-batch to be of the same length. Thus no need for padding zeros.
-|                       | TensorFlow | PaddlePaddle |
+<table>
-|-----------------------|------------|--------------|
+<thead>
-| RNN                   | Support    | Support      |
+<tr>
-| recursive RNN         | Support    | Support      |
+<th></th>
-| padding zeros         | Must       | No need      |
+<th>TensorFlow</th>
-| blob data type        | Tensor     | LoDTensor    |
+<th>PaddlePaddle</th>
 </tr>
 </thead>
 <tbody>
 <tr>
 <td>RNN </td>
 <td>Support </td>
 <td>Support </td>
 </tr>
 <tr>
 <td>recursive RNN </td>
 <td>Support </td>
 <td>Support </td>
 </tr>
 <tr>
 <td>padding zeros </td>
 <td> Must </td>
 <td>No need </td>
 </tr>
 <tr>
 <td> blob data type </td>
 <td> Tensor</td>
 <td> LoDTensor </td>
 </tr>
 </tbody>
 </table>
 PaddlePaddle achieves this flexibility by passing through a new data type, *LoD Tensor*, which is a Tensor attached with segmentation index known as *LoD*, between operators.  The LoD index doesn't only segment a tensor, but also recursively segments sub-sequences.  This document presents the design of LoD and LoDTensor.
--- a/doc/fluid/design/concepts/var_desc.md
+++ b/doc/fluid/design/concepts/var_desc.md
@ -10,10 +10,27 @@ PaddlePaddle uses proto message to describe compile time program because :
 The computation `Program` consists of nested `Blocks`. Each `Block` will consist of data(i.e. `Variable`)  and  `Operations`. The concept to represent them is in the table below.
-| |compile time|runtime|
+<table>
-|---|---|---|
+<thead>
-|Data|VarDesc(proto)|Variable(cpp)|
+<tr>
-|Operation|OpDesc(proto)|Operator(cpp)|
+<th></th>
 <th>compile time</th>
 <th>runtime</th>
 </tr>
 </thead>
 <tbody>
 <tr>
 <td>Data </td>
 <td>VarDesc(proto) </td>
 <td>Variable(cpp) </td>
 </tr>
 <tr>
 <td>Operation </td>
 <td>OpDesc(proto) </td>
 <td>Operator(cpp) </td>
 </tr>
 </tbody>
 </table>
 ## Definition of VarType
--- a/doc/fluid/design/concurrent/concurrent_programming.md
+++ b/doc/fluid/design/concurrent/concurrent_programming.md
@ -10,12 +10,38 @@ The answer relies on the fact that a `ProgramDesc` is similar to an abstract syn
 The following table compares concepts in Fluid and Go
-| Go | Fluid |
+<table>
-|----|-------|
+<thead>
-|user-defined functions | [layers](https://github.com/PaddlePaddle/Paddle/tree/develop/python/paddle/fluid) |
+<tr>
-| control-flow and built-in functions | [intrinsics/operators](https://github.com/PaddlePaddle/Paddle/tree/develop/paddle/operators) |
+<th></th>
-| goroutines, channels | [class ThreadPool](https://github.com/PaddlePaddle/Paddle/tree/develop/paddle/framework/thread_pool.h) |
+<th>Go</th>
-| runtime | [class Executor](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/executor.h) |
+<th>Fluid</th>
 </tr>
 </thead>
 <tbody>
 <tr>
 <td>user-defined functions </td>
 <td>
 <a href="https://github.com/PaddlePaddle/Paddle/tree/develop/python/paddle/fluid">layers</a></td>
 </tr>
 <tr>
 <td>control-flow and built-in functions </td>
 <td>
 <a href="https://github.com/PaddlePaddle/Paddle/tree/develop/paddle/operators">intrinsics/operators</a></td>
 </tr>
 <tr>
 <td>goroutines, channels </td>
 <td>
 <a href="https://github.com/PaddlePaddle/Paddle/tree/develop/paddle/framework/thread_pool.h">class ThreadPool</a></td>
 </tr>
 <tr>
 <td>runtime </td>
 <td>
 <a href="https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/executor.h">class Executor</a></td>
 </tr>
 </tbody>
 </table>
 ## An Example Concurrent Program
@ -77,11 +103,11 @@ message ProgramDesc {
      read(output = X)
      kube_get_workers_addrs(output = L)
      Y = tensor_array(len(L))
-      parallel_for(input = X, output = Y, 
+      parallel_for(input = X, output = Y,
                   attrs = {L, block_id(1)}) # referring to block 1
    ]
  }
-  
+
  block[1] = Block {
    parent = 0,
    vars = [x, y, index],
@ -102,7 +128,7 @@ func main() {  //// block 0
  X = fluid.read(...)
  L = fluid.k8s.get_worker_addrs()
  Y = fluid.tensor_array(len(L))
-  fluid.parallel_for(X, L, 
+  fluid.parallel_for(X, L,
                     func(index int) {  //// block 1
                       x = X[index]
                       fluid.send(L[index], x)
@ -116,7 +142,7 @@ An explanation of the above program:
 - `fluid.k8s` is a package that provides access to Kubernetes API.  
 - `fluid.k8s.get_worker_addrs` returns the list of IP and ports of all pods of the current job except for the current one (the master pod).  
- `fluid.tensor_array` creates a [tensor array](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/lod_tensor_array.h).  `fluid.parallel_for` creates a `ParallelFor` intrinsic, which, when executed, 
+- `fluid.tensor_array` creates a [tensor array](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/lod_tensor_array.h).  `fluid.parallel_for` creates a `ParallelFor` intrinsic, which, when executed,
  1. creates `len(L)` scopes, each for the concurrent running of the sub-block (block 1 in this case), and initializes a variable named "index" in the scope to an integer value in the range `[0, len(L)-1]`, and
  2. creates `len(L)` threads by calling into the `ThreadPool` singleton, each thread  
--- a/doc/fluid/design/concurrent/csp.md
+++ b/doc/fluid/design/concurrent/csp.md
@ -13,14 +13,41 @@ Most DL systems, including TensorFlow, Caffe2, and MxNet, can asynchronously exe
 There were many concurrent programming models, implemented in various forms:
-| concurrent programming model | implementation |
+<table>
-|-----|-----|
+<thead>
-| mutex | types and functions in standard libraries |
+<tr>
-| semaphore | types and functions in standard libraries |
+<th>concurrent programming model</th>
-| communicating sequential processes (CSP) | Go programming language |
+<th>implementation</th>
-| actor model | Erlang programming language |
+</tr>
-| message passing | MPI |
+</thead>
-| bulk synchronous parallel (BSP) | Pregel distributed programming framework |
+<tbody>
 <tr>
 <td>mutex </td>
 <td>types and functions in standard libraries </td>
 </tr>
 <tr>
 <td>semaphore </td>
 <td> types and functions in standard libraries </td>
 </tr>
 <tr>
 <td> communicating sequential processes (CSP)  </td>
 <td> Go programming language </td>
 </tr>
 <tr>
 <td> actor model  </td>
 <td> Erlang programming language </td>
 </tr>
 <tr>
 <td> message passing  </td>
 <td> MPI </td>
 </tr>
 <tr>
 <td> bulk synchronous parallel (BSP)   </td>
 <td> Pregel distributed programming framework </td>
 </tr>
 </tbody>
 </table>
 Since Fluid was designed to be a programming language, we would like to implement CSP in Fluid.
@ -118,9 +145,9 @@ There are four types of actions with a channel:
   ```go
   close(ch)
   ```
-   
+
   Please be aware that a closed channel is not a nil channel, which is `var ch chan int`.
-   
+
 There are some [axioms with channels](https://dave.cheney.net/2014/03/19/channel-axioms):
 1. A send to a nil channel blocks forever
--- a/doc/fluid/design/modules/python_api.md
+++ b/doc/fluid/design/modules/python_api.md
@ -2,12 +2,33 @@
 Due to the refactorization of the PaddlePaddle core, we need Python classes to construct corresponding protobuf messages that describe a DL program.
-| Python classes | Protobuf messages |
+<table>
-| --- | --- |
+<thead>
-| Program | ProgramDesc |
+<tr>
-| Block | BlockDesc |
+<th>Python classes</th>
-| Operator | OpDesc |
+<th>Protobuf messages</th>
-| Variable | VarDesc |
+</tr>
 </thead>
 <tbody>
 <tr>
 <td>Program </td>
 <td>ProgramDesc </td>
 </tr>
 <tr>
 <td>Block  </td>
 <td>BlockDesc </td>
 </tr>
 <tr>
 <td>Operator </td>
 <td>OpDesc </td>
 </tr>
 <tr>
 <td>Variable </td>
 <td>VarDesc </td>
 </tr>
 </tbody>
 </table>
 Please be aware that these Python classes need to maintain some construction-time information, which are not part of the protobuf messages.
--- a/doc/fluid/design/motivation/fluid.md
+++ b/doc/fluid/design/motivation/fluid.md
@ -10,11 +10,37 @@ Fluid is the answer.  Fluid is similar to PyTorch and TensorFlow Eager Execution
 Deep learning infrastructure is one of the fastest evolving technologies. Within four years, there have already been three generations of technologies invented.
-| Existed since | model as sequence of layers | model as graph of operators | No model |
+<table>
-|--|--|--|--|
+<thead>
-| 2013 | Caffe, Theano, Torch, PaddlePaddle | | |
+<tr>
-| 2015 | | TensorFlow, MxNet, Caffe2, ONNX, n-graph | |
+<th>Existed since</th>
-| 2016 | | | PyTorch, TensorFlow Eager Execution, PaddlePaddle Fluid |
+<th>model as sequence of layers</th>
 <th>model as graph of operators</th>
 <th>No model</th>
 </tr>
 </thead>
 <tbody>
 <tr>
 <td>2013 </td>
 <td>Caffe, Theano, Torch, PaddlePaddle </td>
 <td> </td>
 <td> </td>
 </tr>
 <tr>
 <td>2015 </td>
 <td> </td>
 <td>TensorFlow, MxNet, Caffe2, ONNX, n-graph </td>
 <td> </td>
 </tr>
 <tr>
 <td>2016 </td>
 <td> </td>
 <td>   </td>
 <td> PyTorch, TensorFlow Eager Execution, PaddlePaddle Fluid</td>
 </tr>
 </tbody>
 </table>
 From the above table, we see that the deep learning technology is evolving towards getting rid of the concept of a model.  To understand the reasons behind this direction, a comparison of the *programming paradigms* or the ways to program deep learning applications using these systems, would be helpful. The following section goes over these.
--- a/doc/fluid/design/motivation/refactorization.md
+++ b/doc/fluid/design/motivation/refactorization.md
@ -36,11 +36,37 @@ At compile time, the Python program generates a protobuf message representation
 At runtime, the C++ program realizes the graph and runs it.
-| | Representation (protobuf messages) | Realization (C++ class objects) |
+<table>
-|---|---|---|
+<thead>
-|Data|[VarDesc](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/framework.proto#L107)|[Variable](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/variable.h#L24)|
+<tr>
-|Operation|[OpDesc](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/framework.proto#L35)|[Operator](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/operator.h#L64)|
+<th></th>
-|Block|BlockDesc|Block|
+<th>Representation (protobuf messages)</th>
 <th>Realization (C++ class objects) </th>
 </tr>
 </thead>
 <tbody>
 <tr>
 <td>Data</td>
 <td>
 <a href="https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/framework.proto#L107">VarDesc</a></td>
 <td>
 <a href="https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/variable.h#L24">Variable</a></td>
 </tr>
 <tr>
 <td>Operation </td>
 <td>
 <a href="https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/framework.proto#L35">OpDesc</a></td>
 <td>
 <a href="https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/operator.h#L64">Operator</a></td>
 </tr>
 <tr>
 <td>Block </td>
 <td>BlockDesc </td>
 <td>Block </td>
 </tbody>
 </table>
 The word *graph* is interchangeable with *block* in this document.  A graph consists of computation steps and local variables similar to a C++/Java program block, or a pair of parentheses(`{` and `}`).
--- a/doc/fluid/design/network/deep_speech_2.md
+++ b/doc/fluid/design/network/deep_speech_2.md
@ -1,4 +1,4 @@
-# DeepSpeech2 on PaddlePaddle: Design Doc 
+# DeepSpeech2 on PaddlePaddle: Design Doc
 We are planning to build Deep Speech 2 (DS2) \[[1](#references)\], a powerful Automatic Speech Recognition (ASR) engine,  on PaddlePaddle. For the first-stage plan, we have the following short-term goals:
@ -68,11 +68,33 @@ We roughly break down the project into 14 tasks:
 Tasks parallelizable within phases:
-Roadmap     | Description                               | Parallelizable Tasks 
+<table>
----------- | :------------------------------------     | :--------------------
+<thead>
-Phase I	    | Simplified model & components             | *Task 1* ~ *Task 8*
+<tr>
-Phase II    | Standard model & benchmarking & profiling | *Task 9* ~ *Task 12*
+<th>Roadmap</th>
-Phase III   | Documentations                            | *Task13* ~ *Task14*
+<th>Description</th>
 <th> Parallelizable Tasks</th>
 </tr>
 </thead>
 <tbody>
 <tr>
 <td>Phase I </td>
 <td>Simplified model & components </td>
 <td>Task 1 ~ Task 8</td>
 </tr>
 <tr>
 <td>Phase II </td>
 <td> Standard model & benchmarking & profiling</td>
 <td>Task 9 ~ Task 12 </td>
 </tr>
 <tr>
 <td>Phase III </td>
 <td> Documentations</td>
 <td> Task13 ~ Task14 </td>
 </tr>
 </tbody>
 </table>
 Issue for each task will be created later. Contributions, discussions and comments are all highly appreciated and welcomed!
@ -102,37 +124,82 @@ We don't have to persist on this 2-3-7-1-1-1 depth \[[2](#references)\]. Similar
 Key ingredients about the layers:
- **Data Layers**: 
+- **Data Layers**:
   - Frame sequences data of audio **spectrogram** (with FFT).
-   - Token sequences data of **transcription** text (labels). 
+   - Token sequences data of **transcription** text (labels).
   - These two type of sequences do not have the same lengthes, thus a CTC-loss layer is required.
- **2D Convolution Layers**: 
+- **2D Convolution Layers**:
   - Not only temporal convolution, but also **frequency convolution**. Like a 2D image convolution, but with a variable dimension (i.e. temporal dimension).
   - With striding for only the first convlution layer.
   - No pooling for all convolution layers.
- **Uni-directional RNNs** 
+- **Uni-directional RNNs**
 	- Uni-directional + row convolution: for low-latency inference.
 	- Bi-direcitional + without row convolution: if we don't care about the inference latency.
 - **Row convolution**:
 	- For looking only a few steps ahead into the feature, instead of looking into a whole sequence in bi-directional RNNs.
-	- Not nessesary if with bi-direcitional RNNs. 
+	- Not nessesary if with bi-direcitional RNNs.
 	- "**Row**" means convolutions are done within each frequency dimension (row), and no convolution kernels shared across.
 - **Batch Normalization Layers**:
   - Added to all above layers (except for data and loss layer).
   - Sequence-wise normalization for RNNs: BatchNorm only performed on input-state projection and not state-state projection, for efficiency consideration.
- 
+   
-
+<table>
-Required Components                     | PaddlePaddle Support                      | Need to Develop
+<thead>
-:-------------------------------------  | :--------------------------------------   | :-----------------------
+<tr>
-Data Layer I (Spectrogram)	            | Not supported yet.                        |  TBD (Task 3)
+<th>Required Components</th>
-Data Layer II (Transcription)           | `paddle.data_type.integer_value_sequence` | -
+<th> PaddlePaddle Support</th>
-2D Convolution Layer                    | `paddle.layer.image_conv_layer`           | -
+<th> Need to Develop</th>
-DataType Converter (vec2seq)            | `paddle.layer.block_expand`               | -
+</tr>
-Bi-/Uni-directional RNNs                | `paddle.layer.recurrent_group`            | -
+</thead>
-Row Convolution Layer                   | Not supported yet.                        | TBD (Task 4)
+<tbody>
-CTC-loss Layer                          | `paddle.layer.warp_ctc`                   | -
+<tr>
-Batch Normalization Layer               | `paddle.layer.batch_norm`                 | -
+<td>Data Layer I (Spectrogram) </td>
-CTC-Beam search                         | Not supported yet.                        | TBD (Task 6)
+<td>Not supported yet.</td>
 <td>TBD (Task 3)</td>
 </tr>
 <tr>
 <td>Data Layer II (Transcription)  </td>
 <td> paddle.data_type.integer_value_sequence</td>
 <td> - </td>
 </tr>
 <tr>
 <td>2D Convolution Layer </td>
 <td> paddle.layer.image_conv_layer</td>
 <td> - </td>
 </tr>
 <tr>
 <td>DataType Converter (vec2seq)</td>
 <td> paddle.layer.block_expand</td>
 <td> - </td>
 </tr>
 <tr>
 <td>Bi-/Uni-directional RNNs </td>
 <td>paddle.layer.recurrent_group</td>
 <td> - </td>
 </tr>
 <tr>
 <td>Row Convolution Layer </td>
 <td>Not supported yet.</td>
 <td>TBD (Task 4)</td>
 </tr>
 <tr>
 <td>CTC-loss Layer </td>
 <td>paddle.layer.warp_ctc</td>
 <td> - </td>
 </tr>
 <tr>
 <td>Batch Normalization Layer </td>
 <td>paddle.layer.batch_norm</td>
 <td> - </td>
 </tr>
 <tr>
 <td>CTC-Beam search </td>
 <td>Not supported yet.</td>
 <td> TBD (Task 6) </td>
 </tr>
 </tbody>
 </table>
 ### Row Convolution
@ -145,14 +212,14 @@ TODO by Assignees
 Figure 2. Algorithm for CTC Beam Search Decoder.
 </div>
- The **Beam Search Decoder** for DS2 CTC-trained network follows the similar approach in \[[3](#references)\] as shown in Figure 2, with two important modifications for the ambiguous parts: 
+- The **Beam Search Decoder** for DS2 CTC-trained network follows the similar approach in \[[3](#references)\] as shown in Figure 2, with two important modifications for the ambiguous parts:
-   - 1) in the iterative computation of probabilities, the assignment operation is changed to accumulation for one prefix may comes from different paths; 
+   - 1) in the iterative computation of probabilities, the assignment operation is changed to accumulation for one prefix may comes from different paths;
   - 2) the if condition ```if l^+ not in A_prev then``` after probabilities' computation is deprecated for it is hard to understand and seems unnecessary.
 - An **external scorer** would be passed into the decoder to evaluate a candidate prefix during decoding whenever a white space appended in English decoding and any character appended in Mandarin decoding.
 - Such external scorer consists of language model, word count or any other custom scorers.
 - The **language model** is built from Task 5, with parameters should be carefully tuned to achieve minimum WER/CER (c.f. Task 7)
- This decoder needs to perform with **high efficiency** for the convenience of parameters tuning and speech recognition in reality. 
+- This decoder needs to perform with **high efficiency** for the convenience of parameters tuning and speech recognition in reality.
- 
+
 ## Future Work
--- a/doc/fluid/dev/index_cn.rst
+++ b/doc/fluid/dev/index_cn.rst
@ -4,9 +4,9 @@
 .. toctree::
  :maxdepth: 1
-  new_op_en.md
+  new_op_cn.md
-  new_op_kernel_en.md
+  new_op_kernel.md
-  use_eigen_en.md
+  use_eigen_cn.md
  name_convention.md
  support_new_device.md
  releasing_process.md
--- a/doc/fluid/dev/index_en.rst
+++ b/doc/fluid/dev/index_en.rst
@ -5,7 +5,7 @@ Development
  :maxdepth: 1
  new_op_en.md
-  new_op_kernel_en.md
+  new_op_kernel.md
  use_eigen_en.md
  name_convention.md
  support_new_device.md
--- a/doc/fluid/dev/new_op_cn.md
+++ b/doc/fluid/dev/new_op_cn.md
@ -26,13 +26,32 @@
 依据是否包含kernel，可以将Op分为两种：包含Kernel的Op和不包含kernel的Op，前者Op的定义继承自`OperatorWithKernel`，后者继承自`OperatorBase`。本教程主要介绍带Kernel的Op如何写，简单总结Op需要包含的内容如下：
-
+<table>
- 内容            | 定义位置
+<thead>
--------------  | :----------------------
+<tr>
-OpProtoMake定义  | `.cc`文件，Backward Op不需要定义OpProtoMake
+<th>内容</th>
-Op定义           | `.cc`文件
+<th>定义位置</th>
-Kernel实现       | CPU、CUDA共享Kernel实现在`.h`文件中，否则，CPU 实现在`.cc`文件中，CUDA 实现在`.cu`文件中。
+</tr>
-注册Op           | Op注册实现在`.cc`文件；Kernel注册CPU实现在`.cc`文件中，CUDA实现在`.cu`文件中
+</thead>
 <tbody>
 <tr>
 <td>OpProtoMake定义 </td>
 <td>`.cc`文件，Backward Op不需要定义OpProtoMake </td>
 </tr>
 <tr>
 <td>Op定义 </td>
 <td> `.cc`文件</td>
 </tr>
 <tr>
 <td>Kernel实现 </td>
 <td> CPU、CUDA共享Kernel实现在`.h`文件中，否则，CPU 实现在`.cc`文件中，CUDA 实现在`.cu`文件中。</td>
 </tr>
 <tr>
 <td>注册Op </td>
 <td> Op注册实现在`.cc`文件；Kernel注册CPU实现在`.cc`文件中，CUDA实现在`.cu`文件中</td>
 </tr>
 </tbody>
 </table>
 实现新的op都添加至目录[paddle/operators](https://github.com/PaddlePaddle/Paddle/tree/develop/paddle/operators)下，文件命名以`*_op.h`（如有） 、 `*_op.cc` 、`*_op.cu`（如有）结尾。**系统会根据文件名自动构建op和其对应的Python扩展。**
--- a/doc/fluid/dev/new_op_en.md
+++ b/doc/fluid/dev/new_op_en.md
@ -33,6 +33,33 @@ Op definition           | `.cc` files
 Kernel implementation       | The kernel methods shared between CPU and CUDA are defined in `.h` files. CPU-specific kernels live in `.cc` files, while CUDA-specific kernels are implemented in `.cu`files.
 Registering the Op           | Ops are registered in `.cc` files; For Kernel registration, `.cc` files contain the CPU implementation, while `.cu` files contain the CUDA implementation.
 <table>
 <thead>
 <tr>
 <th>Information</th>
 <th> Where is it defined</th>
 </tr>
 </thead>
 <tbody>
 <tr>
 <td>OpProtoMake definition </td>
 <td> `.cc`files, Backward Op does not need an OpProtoMake interface. </td>
 </tr>
 <tr>
 <td>Op definition  </td>
 <td> `.cc` files</td>
 </tr>
 <tr>
 <td>Kernel implementation  </td>
 <td> The kernel methods shared between CPU and CUDA are defined in `.h` files. CPU-specific kernels live in `.cc` files, while CUDA-specific kernels are implemented in `.cu`files.</td>
 </tr>
 <tr>
 <td>Registering the Op  </td>
 <td> Ops are registered in `.cc` files; For Kernel registration, `.cc` files contain the CPU implementation, while `.cu` files contain the CUDA implementation.</td>
 </tr>
 </tbody>
 </table>
 New Operator implementations are added to the list [paddle/operators](https://github.com/PaddlePaddle/Paddle/tree/develop/paddle/operators), with file names in the format `*_op.h` (if applicable), `*_op.cc`, `*_op.cu` (if applicable).** The system will use the naming scheme to automatically build operators and their corresponding Python extensions.**
@ -279,7 +306,7 @@ A forward operator unit test inherits `unittest.TestCase` and defines metaclass
      def test_check_output(self):
          self.check_output()
-          
+
      def test_check_grad_normal(self):
          self.check_grad(['X', 'Y'], 'Out', max_relative_error=0.5)
--- a/Show More
+++ b/Show More