Add benchmark config and document

9 years ago · a8342d073d
parent 58f896c3f4
commit a8342d073d
35 changed files with 5942 additions and 0 deletions
--- a/benchmark/README.md
+++ b/benchmark/README.md
@ -0,0 +1,168 @@
 # Benchmark
 Machine: 
 - CPU: 12-core Intel(R) Xeon(R) CPU E5-2620 v2 @2.10GHz
 - GPU: Tesla K40m
 - cuDNN: v5.1
 - system: Docker 1.12.1, all platform are tested in docker environment.
 Platform: 
 - PaddlePaddle: 
 - Tensorflow: gcr.io/tensorflow/tensorflow:0.11.0rc0-gpu 
 - Caffe: 
 Several convolutional neural networks and recurrent neural network are used to test.
 ## Image
 ### Benchmark Model
 AlexNet, GooleNet and a small network which refer the config of cifar10 in Caffe are used.
 - [AlexNet](https://github.com/BVLC/caffe/tree/master/models/bvlc_alexnet): but the group size is one.
 - [GoogleNet](https://github.com/BVLC/caffe/tree/master/models/bvlc_googlenet): but remove loss1 and loss2 when testing benchmark.
 - [SmallNet](https://github.com/BVLC/caffe/blob/master/examples/cifar10/cifar10\_quick\_train\_test.prototxt)
 ### Singe-GPU
 - AlexNet:  input - 3 * 227 * 227,  Time: ms/batch
 | BatchSize    | 64  | 128  | 256   | 512  |
 |--------------|-----| -----| ------| -----|
 | PaddlePaddle | 195 | 334  | 602   | 1629 |
 | TensorFlow   | 223 | 364  | 645   | 1235 |
 | Caffe        | 324 | 627  | 1232  | 2513 |
 ##### Notation
 All platforms use cuDnn-v5.1. You might see that caffe is slower, because the workspace limit size is 8 * 1024 * 1024 in Caffe's cuDnn-conv interface. This size is larger in PaddlePaddle and TensorFlow. Caffe will be faster if increasing the workspace limit size.
 - GoogletNet:  input - 3 * 224 * 224, Time: ms/batch
 | BatchSize    | 64    |   128  | 256     |
 |--------------|-------| -------| --------|
 | PaddlePaddle | 613   | 1149   | 2348    |
 | TensorFlow   | 644   | 1176   | 2219    |
 | Caffe        | 694   | 1364   | out of memory   |
 - SmallNet: input - 3 * 32 * 32, Time ms/batch
 | BatchSize    | 64     |   128    | 256     | 512     |
 |--------------|--------| -------- | --------|---------|
 | PaddlePaddle | 10.463 | 18.184   | 33.113  |  63.039 |
 | TensorFlow   | 9     | 15       | 28      | 59       |
 | Caffe        | 9.373  | 16.6606  | 31.4797 | 59.719  |
 ##### Notation
 All the tests in caffe use `caffe time` to execute, which is not including the parameter updating process. But the time in PaddlePaddle and TensorFlow contains it.
 In Tensorflow, they implement algorithm searching method instead of using the algorithm searching interface in cuDNN.
 ### Multi-GPU: 4 GPUs
 - AlexNet,  ms / batch
 | totoal-BatchSize | 128 * 4  | 256 * 4    |
 |------------------|----------| -----------|
 | PaddlePaddle     | 347      | 622        |
 | TensorFlow       | 377      | 675        |
 | Caffe            | 1229     | 2435       |
 For example, if `totoal-BatchSize = 128 * 4`, the speed is calculated by 
 ```
  time_at_1gpu_batch_128 * 4 / time_at_4gpu_total_batch_512 
 = (334 * 4)/347 
 = 3.85
 ``` 
 <img src="figs/alexnet-4gpu.png" width="420">
 - GooleNet, ms / batch
 | totoal-BatchSize  | 128 * 4      |  256 * 4    |
 |-------------------|--------------| ----------- |
 | PaddlePaddle      | 1178         | 2367        |
 | TensorFlow        | 1210         | 2292        |
 | Caffe             | 2007         | out of memory  |
 <img src="figs/googlenet-4gpu.png" width="420">
 ## RNN
 We use lstm network for text classfication to test benchmark.
 ### Dataset
 -  [IMDB](http://www.iro.umontreal.ca/~lisa/deep/data/imdb.pkl)
 - Sequence legth=100, in fact, PaddlePaddle support training with variable-length sequence. But TensorFlow need to pad, in order to compare, we also pad sequence length to 100 in PaddlePaddle.
 - Dictionary size=30000 
 - Peephole connection is used in `lstmemory` by default in PaddlePaddle. It is also configured in TensorFlow.
 ### Single GPU
 #### LSTM in Text Classification
 Testing network for different hidden size, batch size with `2 lstm layer + fc` network.
 - Batch size = 64, ms / batch
 | hidden_size  | 256   | 512    |  1280   |
 |--------------|-------| -------| --------|
 | PaddlePaddle | 83    | 184    | 641     |
 | TensorFlow   | 175   | 280    | 818     |
 - Batch size = 128, ms / batch
 | hidden_size  | 256    | 512    |  1280   |
 |--------------|------- | -------| --------|
 | PaddlePaddle | 110    | 261    | 1007    |
 | TensorFlow   | 181    | 361    | 1237    |
 - Batch size = 256, ms / batch
 | hidden_size  | 256   | 512    |  1280   |
 |--------------|-------| -------| --------|
 | PaddlePaddle | 170   | 414    | 1655    |
 | TensorFlow   | 238   | 536    | 1905    |
 <img src="figs/rnn_lstm_cls.png" width="600">
 #### Seq2Seq
 The benchmark of sequence-to-sequence network will be add later.
 ### Multi GPU: 4 GPUs
 #### LSTM in Text Classification
 - hidden_size = 256, ms / batch
 | batch_size   | 256    |  512    |
 |--------------| -------| --------|
 | PaddlePaddle | 90     | 118     |
 | TensorFlow   | 226    | 118     |
 - hidden_size = 512, ms / batch
 | batch_size   | 256    |  512    |
 |--------------| -------| --------|
 | PaddlePaddle | 189    | 268     |
 | TensorFlow   | 297    | 383     |
 <img src="figs/rnn_lstm_4gpus.png" width="420">
 #### Seq2Seq
 The benchmark of sequence-to-sequence network will be add later.
--- a/benchmark/caffe/image/alexnet.prototxt
+++ b/benchmark/caffe/image/alexnet.prototxt
--- a/benchmark/caffe/image/googlenet.prototxt
+++ b/benchmark/caffe/image/googlenet.prototxt
--- a/benchmark/caffe/image/run.sh
+++ b/benchmark/caffe/image/run.sh
@ -0,0 +1,30 @@
 set -e
 function test() {
  cfg=$1
  batch=$2
  prefix=$3
  sed -i "/input: \"data\"/{n;s/^input_dim.*/input_dim: $batch/g}" $cfg 
  sed -i "/input: \"label\"/{n;s/^input_dim.*/input_dim: $batch/g}" $cfg
  caffe time --model=$cfg --iterations=50 --gpu 0 > logs/$prefix-1gpu-batch${batch}.log 2>&1
 }
 if [ ! -d "logs" ]; then
  mkdir logs
 fi
 # alexnet
 test alexnet.prototxt 64 alexnet 
 test alexnet.prototxt 128 alexnet 
 test alexnet.prototxt 256 alexnet 
 test alexnet.prototxt 512 alexnet 
 # googlenet
 test googlenet.prototxt 64 googlenet 
 test googlenet.prototxt 128 googlenet 
 # small net 
 test smallnet_mnist_cifar.prototxt 64 smallnet 
 test smallnet_mnist_cifar.prototxt 128 smallnet 
 test smallnet_mnist_cifar.prototxt 256 smallnet 
 test smallnet_mnist_cifar.prototxt 512 smallnet 
--- a/benchmark/caffe/image/run_multi.sh
+++ b/benchmark/caffe/image/run_multi.sh
@ -0,0 +1,24 @@
 #!/bin/bash
 set -e
 function test() {
  cfg=$1
  batch=$2
  prefix=$3
  batch_per_gpu=`expr ${batch} / 4`
  sed -i "/input: \"data\"/{n;s/^input_dim.*/input_dim: ${batch_per_gpu}/g}" $cfg 
  sed -i "/input: \"label\"/{n;s/^input_dim.*/input_dim: ${batch_per_gpu}/g}" $cfg 
  sed -i "1c\net : \"${cfg}\"" solver.prototxt
  caffe train --solver=solver.prototxt -gpu all > logs/${prefix}-4gpu-batch${batch}.log 2>&1
 }
 if [ ! -d "logs" ]; then
  mkdir logs
 fi
 # alexnet
 test alexnet.prototxt 512 alexnet 
 test alexnet.prototxt 1024 alexnet 
 # googlnet 
 test googlenet.prototxt 512 googlenet 
--- a/benchmark/caffe/image/smallnet_mnist_cifar.prototxt
+++ b/benchmark/caffe/image/smallnet_mnist_cifar.prototxt
@ -0,0 +1,198 @@
 name: "mnist/cifar"
 input: "data"
 input_dim: 128 
 input_dim: 3
 input_dim: 32 
 input_dim: 32 
 input: "label"
 input_dim: 128 
 input_dim: 1
 input_dim: 1
 input_dim: 1 
 layer {
  name: "conv1"
  type: "Convolution"
  bottom: "data"
  top: "conv1"
  param {
    lr_mult: 1
  }
  param {
    lr_mult: 2
  }
  convolution_param {
    num_output: 32
    pad: 2
    kernel_size: 5
    stride: 1
    weight_filler {
      type: "gaussian"
      std: 0.0001
    }
    bias_filler {
      type: "constant"
    }
  }
 }
 layer {
  name: "pool1"
  type: "Pooling"
  bottom: "conv1"
  top: "pool1"
  pooling_param {
    pool: MAX
    kernel_size: 3
    stride: 2
  }
 }
 layer {
  name: "relu1"
  type: "ReLU"
  bottom: "pool1"
  top: "pool1"
 }
 layer {
  name: "conv2"
  type: "Convolution"
  bottom: "pool1"
  top: "conv2"
  param {
    lr_mult: 1
  }
  param {
    lr_mult: 2
  }
  convolution_param {
    num_output: 32
    pad: 2
    kernel_size: 5
    stride: 1
    weight_filler {
      type: "gaussian"
      std: 0.01
    }
    bias_filler {
      type: "constant"
    }
  }
 }
 layer {
  name: "relu2"
  type: "ReLU"
  bottom: "conv2"
  top: "conv2"
 }
 layer {
  name: "pool2"
  type: "Pooling"
  bottom: "conv2"
  top: "pool2"
  pooling_param {
    pool: AVE
    kernel_size: 3
    stride: 2
  }
 }
 layer {
  name: "conv3"
  type: "Convolution"
  bottom: "pool2"
  top: "conv3"
  param {
    lr_mult: 1
  }
  param {
    lr_mult: 2
  }
  convolution_param {
    num_output: 64
    pad: 2
    kernel_size: 5
    stride: 1
    weight_filler {
      type: "gaussian"
      std: 0.01
    }
    bias_filler {
      type: "constant"
    }
  }
 }
 layer {
  name: "relu3"
  type: "ReLU"
  bottom: "conv3"
  top: "conv3"
 }
 layer {
  name: "pool3"
  type: "Pooling"
  bottom: "conv3"
  top: "pool3"
  pooling_param {
    pool: AVE
    kernel_size: 3
    stride: 2
  }
 }
 layer {
  name: "ip1"
  type: "InnerProduct"
  bottom: "pool3"
  top: "ip1"
  param {
    lr_mult: 1
  }
  param {
    lr_mult: 2
  }
  inner_product_param {
    num_output: 64
    weight_filler {
      type: "gaussian"
      std: 0.1
    }
    bias_filler {
      type: "constant"
    }
  }
 }
 layer {
  name: "ip2"
  type: "InnerProduct"
  bottom: "ip1"
  top: "ip2"
  param {
    lr_mult: 1
  }
  param {
    lr_mult: 2
  }
  inner_product_param {
    num_output: 10
    weight_filler {
      type: "gaussian"
      std: 0.1
    }
    bias_filler {
      type: "constant"
    }
  }
 }
 layer {
  name: "accuracy"
  type: "Accuracy"
  bottom: "ip2"
  bottom: "label"
  top: "accuracy"
  include {
    phase: TEST
  }
 }
 layer {
  name: "loss"
  type: "SoftmaxWithLoss"
  bottom: "ip2"
  bottom: "label"
  top: "loss"
 }
--- a/benchmark/caffe/image/solver.prototxt
+++ b/benchmark/caffe/image/solver.prototxt
@ -0,0 +1,10 @@
 net: "alexnet.prototxt"
 base_lr: 0.01
 lr_policy: "fixed"
 display: 20
 max_iter: 200
 momentum: 0.9
 weight_decay: 0.0005
 snapshot: 10000
 snapshot_prefix: "models/caffe_alexnet_train"
 solver_mode: GPU
--- a/benchmark/figs/alexnet-4gpu.png
+++ b/benchmark/figs/alexnet-4gpu.png
--- a/benchmark/figs/googlenet-4gpu.png
+++ b/benchmark/figs/googlenet-4gpu.png
--- a/benchmark/figs/rnn_lstm_4gpus.png
+++ b/benchmark/figs/rnn_lstm_4gpus.png
--- a/benchmark/figs/rnn_lstm_cls.png
+++ b/benchmark/figs/rnn_lstm_cls.png
--- a/benchmark/paddle/image/alexnet.py
+++ b/benchmark/paddle/image/alexnet.py
@ -0,0 +1,57 @@
 #!/usr/bin/env python
 from paddle.trainer_config_helpers import *
 height=227
 width=227
 num_class = 1000
 batch_size = get_config_arg('batch_size', int, 128) 
 args={'height':height, 'width':width, 'color':True, 'num_class':num_class}
 define_py_data_sources2("train.list",
                        None,
                        module="provider",
                        obj="process",
                        args=args)
 settings(
    batch_size = batch_size,
    learning_rate = 0.01 / batch_size,
    learning_method = MomentumOptimizer(0.9),
    regularization = L2Regularization(0.0005 * batch_size)
 )
 # conv1
 net = data_layer('data', size=height * width * 3)
 net = img_conv_layer(input=net, filter_size=11, num_channels=3,
      num_filters=96, stride=4, padding=1)
 net = img_cmrnorm_layer(input=net, size=5, scale=0.0001, power=0.75)
 net = img_pool_layer(input=net, pool_size=3, stride=2) 
 # conv2
 net = img_conv_layer(input=net, filter_size=5, num_filters=256,
      stride=1, padding=2, groups=1)
 net = img_cmrnorm_layer(input=net, size=5, scale=0.0001, power=0.75)
 net = img_pool_layer(input=net, pool_size=3, stride=2)
 # conv3
 net = img_conv_layer(input=net, filter_size=3, num_filters=384,
      stride=1, padding=1)
 # conv4
 net = img_conv_layer(input=net, filter_size=3, num_filters=384,
      stride=1, padding=1, groups=1)
 # conv5
 net = img_conv_layer(input=net, filter_size=3, num_filters=256,
      stride=1, padding=1, groups=1)
 net = img_pool_layer(input=net, pool_size=3, stride=2)
 net = fc_layer(input=net, size=4096, act=ReluActivation(), layer_attr=ExtraAttr(drop_rate=0.5))
 net = fc_layer(input=net, size=4096, act=ReluActivation(), layer_attr=ExtraAttr(drop_rate=0.5))
 net = fc_layer(input=net, size=1000, act=SoftmaxActivation())
 lab = data_layer('label', num_class)
 loss = cross_entropy(input=net, label=lab) 
 outputs(loss)
--- a/benchmark/paddle/image/googlenet.py
+++ b/benchmark/paddle/image/googlenet.py
@ -0,0 +1,147 @@
 #!/usr/bin/env python
 from paddle.trainer_config_helpers import *
 height=224
 width=224
 num_class = 1000
 batch_size = get_config_arg('batch_size', int, 128) 
 args={'height':height, 'width':width, 'color':True, 'num_class':num_class}
 define_py_data_sources2("train.list",
                        None,
                        module="provider",
                        obj="process",
                        args=args)
 settings(
    batch_size = batch_size,
    learning_rate = 0.01 / batch_size,
    learning_method = MomentumOptimizer(0.9),
    regularization = L2Regularization(0.0005 * batch_size)
 )
 def inception2(name, input, channels, \
    filter1,
    filter3R, filter3,
    filter5R, filter5,
    proj):
    conv1 = name + '_1'
    conv3r = name + '_3r'
    conv3 = name + '_3'
    conv5r = name + '_5r'
    conv5 = name + '_5'
    maxpool = name + '_max'
    convproj = name + '_proj'
    cov1 = img_conv_layer(name=conv1, input=input, filter_size=1,
                          num_channels=channels, num_filters=filter1,
                          stride=1, padding=0)
    cov3r = img_conv_layer(name=conv3r, input=input, filter_size=1,
                           num_channels=channels, num_filters=filter3R,
                           stride=1, padding=0)
    cov3 = img_conv_layer(name=conv3, input=cov3r, filter_size=3,
                          num_filters=filter3, stride=1, padding=1)
    cov5r = img_conv_layer(name=conv5r, input=input, filter_size=1,
                           num_channels=channels, num_filters=filter5R,
                           stride=1, padding=0)
    cov5 = img_conv_layer(name=conv5, input=cov5r, filter_size=5,
                          num_filters=filter5, stride=1, padding=2)
    pool1 = img_pool_layer(name=maxpool, input=input, pool_size=3,
                           num_channels=channels, stride=1, padding=1)
    covprj = img_conv_layer(name=convproj, input=pool1, filter_size=1,
                            num_filters=proj, stride=1, padding=0)
    cat = concat_layer(name=name, input=[cov1, cov3, cov5, covprj])
    return cat
 def inception(name, input, channels, \
    filter1,
    filter3R, filter3,
    filter5R, filter5,
    proj):
    cov1 = conv_projection(input=input, filter_size=1, num_channels=channels,
                           num_filters=filter1, stride=1, padding=0)
    cov3r = img_conv_layer(name=name + '_3r', input=input, filter_size=1,
                           num_channels=channels, num_filters=filter3R,
                           stride=1, padding=0)
    cov3 = conv_projection(input=cov3r, filter_size=3, num_filters=filter3,
                           stride=1, padding=1)
    cov5r = img_conv_layer(name=name + '_5r', input=input, filter_size=1,
                           num_channels=channels, num_filters=filter5R,
                           stride=1, padding=0)
    cov5 = conv_projection(input=cov5r, filter_size=5, num_filters=filter5,
                           stride=1, padding=2)
    pool1 = img_pool_layer(name=name + '_max', input=input, pool_size=3,
                           num_channels=channels, stride=1, padding=1)
    covprj = conv_projection(input=pool1, filter_size=1, num_filters=proj,
                             stride=1, padding=0)
    cat = concat_layer(name=name, input=[cov1, cov3, cov5, covprj],
                       bias_attr=True, act=ReluActivation())
    return cat
 lab = data_layer(name="label", size=1000)
 data = data_layer(name="input", size=3 * height * width)
 # stage 1
 conv1 = img_conv_layer(name="conv1", input=data, filter_size=7,
                       num_channels=3, num_filters=64, stride=2, padding=3)
 pool1 = img_pool_layer(name="pool1", input=conv1, pool_size=3,
                       num_channels=64, stride=2)
 # stage 2
 conv2_1 = img_conv_layer(name="conv2_1", input=pool1, filter_size=1,
                         num_filters=64, stride=1, padding=0)
 conv2_2 = img_conv_layer(name="conv2_2", input=conv2_1, filter_size=3,
                         num_filters=192, stride=1, padding=1)
 pool2 = img_pool_layer(name="pool2", input=conv2_2, pool_size=3,
                       num_channels=192, stride=2)
 # stage 3
 ince3a = inception("ince3a", pool2,  192,  64, 96, 128, 16, 32, 32) 
 ince3b = inception("ince3b", ince3a, 256, 128, 128,192, 32, 96, 64) 
 pool3 = img_pool_layer(name="pool3", input=ince3b, num_channels=480, pool_size=3, stride=2)
 # stage 4
 ince4a = inception("ince4a", pool3,  480, 192, 96,  208, 16, 48, 64)  
 ince4b = inception("ince4b", ince4a, 512, 160, 112, 224, 24, 64, 64) 
 ince4c = inception("ince4c", ince4b, 512, 128, 128, 256, 24, 64, 64)
 ince4d = inception("ince4d", ince4c, 512, 112, 144, 288, 32, 64, 64)  
 ince4e = inception("ince4e", ince4d, 528, 256, 160, 320, 32, 128, 128) 
 pool4 = img_pool_layer(name="pool4", input=ince4e, num_channels=832, pool_size=3, stride=2)
 # stage 5
 ince5a = inception("ince5a", pool4,  832, 256, 160, 320, 32, 128, 128)
 ince5b = inception("ince5b", ince5a, 832, 384, 192, 384, 48, 128, 128)
 pool5 = img_pool_layer(name="pool5", input=ince5b, num_channels=1024, pool_size=7, stride=7, pool_type=AvgPooling())
 # We remove loss1 and loss2 for all system when testing benchmark
 # output 1
 # pool_o1 = img_pool_layer(name="pool_o1", input=ince4a, num_channels=512, pool_size=5, stride=3, pool_type=AvgPooling())
 # conv_o1 = img_conv_layer(name="conv_o1", input=pool_o1, filter_size=1, num_filters=128, stride=1, padding=0)
 # fc_o1 = fc_layer(name="fc_o1", input=conv_o1, size=1024, layer_attr=ExtraAttr(drop_rate=0.7), act=ReluActivation())
 # out1 = fc_layer(name="output1", input=fc_o1,  size=1000, act=SoftmaxActivation())
 # loss1 = cross_entropy(name='loss1', input=out1, label=lab, coeff=0.3) 
 # output 2
 #pool_o2 = img_pool_layer(name="pool_o2", input=ince4d, num_channels=528, pool_size=5, stride=3, pool_type=AvgPooling())
 #conv_o2 = img_conv_layer(name="conv_o2", input=pool_o2, filter_size=1, num_filters=128, stride=1, padding=0)
 #fc_o2 = fc_layer(name="fc_o2", input=conv_o2, size=1024, layer_attr=ExtraAttr(drop_rate=0.7), act=ReluActivation())
 #out2 = fc_layer(name="output2", input=fc_o2, size=1000, act=SoftmaxActivation())
 #loss2 = cross_entropy(name='loss2', input=out2, label=lab, coeff=0.3) 
 # output 3
 dropout = dropout_layer(name="dropout", input=pool5, dropout_rate=0.4)
 out3 = fc_layer(name="output3", input=dropout, size=1000, act=SoftmaxActivation())
 loss3 = cross_entropy(name='loss3', input=out3, label=lab) 
 outputs(loss3)
--- a/benchmark/paddle/image/provider.py
+++ b/benchmark/paddle/image/provider.py
@ -0,0 +1,24 @@
 import io,os
 import random
 import numpy as np
 from paddle.trainer.PyDataProvider2 import *
 def initHook(settings, height, width, color, num_class, **kwargs):
    settings.height = height 
    settings.width = width 
    settings.color = color 
    settings.num_class = num_class 
    if settings.color:
        settings.data_size = settings.height * settings.width * 3
    else:
        settings.data_size = settings.height * settings.width
    settings.slots = [dense_vector(settings.data_size), integer_value(1)]
@provider(init_hook=initHook, min_pool_size=-1, cache=CacheType.CACHE_PASS_IN_MEM)
 def process(settings, file_list):
    with open(file_list, 'r') as fdata:
        for line in fdata:
            img = np.random.rand(1, settings.data_size).reshape(-1, 1).flatten()
            lab = random.randint(0, settings.num_class)
            yield img.tolist(), int(lab)
--- a/benchmark/paddle/image/run.sh
+++ b/benchmark/paddle/image/run.sh
@ -0,0 +1,54 @@
 set -e
 function gen_file() {
  if [ ! -d "train.txt" ]; then
    for ((i=1;i<=1024;i++))
    do
      echo "train/n09246464/n09246464_38735.jpeg 972" >> train.txt
    done
  fi
  if [ ! -d "train.list" ]; then
    echo "train.txt" > train.list
  fi
 }
 function train() {
  cfg=$1
  thread=$2
  bz=$3
  args="batch_size=$3"
  prefix=$4
  paddle train --job=time \
    --config=$cfg \
    --use_gpu=True \
    --trainer_count=$thread \
    --log_period=10 \
    --test_period=100 \
    --config_args=$args \
    --cudnn_dir=/home/dangqingqing/tools/cudnn-5.1/lib64 \
    > logs/$prefix-${thread}gpu-$bz.log 2>&1 
 }
 gen_file
 if [ ! -d "logs" ]; then
  mkdir logs
 fi
 #========single-gpu=========#
 # alexnet
 train alexnet.py 1 64 alexnet
 train alexnet.py 1 128 alexnet
 train alexnet.py 1 256 alexnet
 train alexnet.py 1 512 alexnet
 # googlenet
 train googlenet.py 1 64 googlenet
 train googlenet.py 1 128 googlenet
 train googlenet.py 1 256 googlenet
 # smallnet
 train smallnet_mnist_cifar.py 1 64 smallnet
 train smallnet_mnist_cifar.py 1 128 smallnet
 train smallnet_mnist_cifar.py 1 256 smallnet
 train smallnet_mnist_cifar.py 1 512 smallnet
--- a/benchmark/paddle/image/run_multi.sh
+++ b/benchmark/paddle/image/run_multi.sh
@ -0,0 +1,42 @@
 set -e
 function gen_file() {
  if [ ! -d "train.txt" ]; then
    for ((i=1;i<=1024;i++))
    do
      echo "train/n09246464/n09246464_38735.jpeg 972" >> train.txt
    done
  fi
  if [ ! -d "train.list" ]; then
    echo "train.txt" > train.list
  fi
 }
 function train() {
  cfg=$1
  thread=$2
  bz=$3
  args="batch_size=$3"
  prefix=$4
  paddle train --job=time \
    --config=$cfg \
    --use_gpu=True \
    --trainer_count=$thread \
    --log_period=10 \
    --test_period=100 \
    --config_args=$args \
    > logs/$prefix-${thread}gpu-$bz.log 2>&1 
 }
 gen_file
 if [ ! -d "logs" ]; then
  mkdir logs
 fi
 #========multi-gpus=========#
 train alexnet.py 4 512 alexnet
 train alexnet.py 4 1024 alexnet
 train googlenet.py 4 512 googlenet 
 train googlenet.py 4 1024 googlenet
--- a/benchmark/paddle/image/smallnet_mnist_cifar.py
+++ b/benchmark/paddle/image/smallnet_mnist_cifar.py
@ -0,0 +1,47 @@
 #!/usr/bin/env python
 from paddle.trainer_config_helpers import *
 height=32
 width=32
 num_class = 10
 batch_size = get_config_arg('batch_size', int, 128) 
 args={'height':height, 'width':width, 'color':True, 'num_class':num_class}
 define_py_data_sources2("train.list",
                        None,
                        module="provider",
                        obj="process",
                        args=args)
 settings(
    batch_size = batch_size,
    learning_rate = 0.01 / batch_size,
    learning_method = MomentumOptimizer(0.9),
    regularization = L2Regularization(0.0005 * batch_size)
 )
 # conv1
 net = data_layer('data', size=height * width * 3)
 net = img_conv_layer(input=net, filter_size=5, num_channels=3,
                     num_filters=32, stride=1, padding=2)
 net = img_pool_layer(input=net, pool_size=3, stride=2, padding=1)
 # conv2
 net = img_conv_layer(input=net, filter_size=5, num_filters=32,
                     stride=1, padding=2)
 net = img_pool_layer(input=net, pool_size=3, stride=2, padding=1, pool_type=AvgPooling())
 # conv3
 net = img_conv_layer(input=net, filter_size=3, num_filters=64,
                     stride=1, padding=1)
 net = img_pool_layer(input=net, pool_size=3, stride=2, padding=1, pool_type=AvgPooling())
 net = fc_layer(input=net, size=64, act=ReluActivation())
 net = fc_layer(input=net, size=10, act=SoftmaxActivation())
 lab = data_layer('label', num_class)
 loss = classification_cost(input=net, label=lab)
 outputs(loss)
--- a/benchmark/paddle/rnn/imdb.py
+++ b/benchmark/paddle/rnn/imdb.py
@ -0,0 +1,42 @@
 from __future__ import print_function
 import six.moves.cPickle as pickle
 import gzip
 import os
 import numpy
 def get_dataset_file(dataset, default_dataset, origin):
    data_dir, data_file = os.path.split(dataset)
    if (not os.path.isfile(dataset)) and data_file == default_dataset:
        from six.moves import urllib
        print('Downloading data from %s' % origin)
        urllib.request.urlretrieve(origin, dataset)
    return dataset
 def create_data(path="imdb.pkl"):
    if (not os.path.isfile('imdb.train.pkl')):
        path = get_dataset_file(
            path, "imdb.pkl",
            "http://www.iro.umontreal.ca/~lisa/deep/data/imdb.pkl")
        if path.endswith(".gz"):
            f = gzip.open(path, 'rb')
        else:
            f = open(path, 'rb')
        train_set = pickle.load(f)
        test_set = pickle.load(f)
        f.close()
        pickle.dump(train_set, open('imdb.train.pkl', 'wb'))
        pickle.dump(test_set, open('imdb.test.pkl', 'wb'))
    if (not os.path.isfile('train.list')):
        file('train.list', 'w').write('imdb.train.pkl\n')
 def main():
    create_data('imdb.pkl')
 if __name__ == "__main__":
    main()
--- a/benchmark/paddle/rnn/provider.py
+++ b/benchmark/paddle/rnn/provider.py
@ -0,0 +1,64 @@
 import io,os
 import random
 import numpy as np
 import six.moves.cPickle as pickle
 from paddle.trainer.PyDataProvider2 import *
 def remove_unk(x, n_words):
    return [[1 if w >= n_words else w for w in sen] for sen in x]
 # ==============================================================
 #  tensorflow uses fixed length, but PaddlePaddle can process
 #  variable-length. Padding is used in benchmark in order to
 #  compare with other platform. 
 # ==============================================================
 def pad_sequences(sequences, maxlen=None, dtype='int32', padding='post',
                  truncating='post', value=0.):
    lengths = [len(s) for s in sequences]
    nb_samples = len(sequences)
    if maxlen is None:
        maxlen = np.max(lengths)
    x = (np.ones((nb_samples, maxlen)) * value).astype(dtype)
    for idx, s in enumerate(sequences):
        if len(s) == 0:
            continue  # empty list was found
        if truncating == 'pre':
            trunc = s[-maxlen:]
        elif truncating == 'post':
            trunc = s[:maxlen]
        else:
            raise ValueError("Truncating type '%s' not understood" % padding)
        if padding == 'post':
            x[idx, :len(trunc)] = trunc
        elif padding == 'pre':
            x[idx, -len(trunc):] = trunc
        else:
            raise ValueError("Padding type '%s' not understood" % padding)
    return x
 def initHook(settings, vocab_size, pad_seq, maxlen, **kwargs):
    settings.vocab_size = vocab_size
    settings.pad_seq = pad_seq
    settings.maxlen = maxlen 
    settings.input_types = [
        integer_value_sequence(vocab_size),
        integer_value(2)]
@provider(init_hook=initHook, min_pool_size=-1, cache=CacheType.CACHE_PASS_IN_MEM)
 def process(settings, file):
    f = open(file, 'rb')
    train_set = pickle.load(f)
    f.close()
    x, y = train_set
    # remove unk, namely remove the words out of dictionary
    x = remove_unk(x, settings.vocab_size)
    if settings.pad_seq: 
        x = pad_sequences(x, maxlen=settings.maxlen, value=0.)
    for i in range(len(y)):
        yield map(int,x[i]), int(y[i])
--- a/benchmark/paddle/rnn/rnn.py
+++ b/benchmark/paddle/rnn/rnn.py
@ -0,0 +1,42 @@
 #!/usr/bin/env python
 from paddle.trainer_config_helpers import *
 import imdb
 num_class = 2
 vocab_size = 30000
 fixedlen = 100
 batch_size = get_config_arg('batch_size', int, 128) 
 lstm_num = get_config_arg('lstm_num', int, 1) 
 hidden_size = get_config_arg('hidden_size', int, 128) 
 # whether to pad sequence into fixed length
 pad_seq = get_config_arg('pad_seq', bool, True)
 imdb.create_data('imdb.pkl')
 args={'vocab_size':vocab_size, 'pad_seq':pad_seq, 'maxlen':fixedlen}
 define_py_data_sources2("train.list",
                        None,
                        module="provider",
                        obj="process",
                        args=args)
 settings(
    batch_size=batch_size,
    learning_rate=2e-3,
    learning_method=AdamOptimizer(),
    regularization=L2Regularization(8e-4),
    gradient_clipping_threshold=25
 )
 net = data_layer('data', size=vocab_size)
 net = embedding_layer(input=net, size=128)
 for i in xrange(lstm_num):
    net = simple_lstm(input=net, size=hidden_size) 
 net = last_seq(input=net)
 net = fc_layer(input=net, size=2, act=SoftmaxActivation())
 lab = data_layer('label', num_class)
 loss = classification_cost(input=net, label=lab)
 outputs(loss)
--- a/benchmark/paddle/rnn/run.sh
+++ b/benchmark/paddle/rnn/run.sh
@ -0,0 +1,38 @@
 set -e
 function train() {
  cfg=$1
  thread=$2
  args="lstm_num=${3},seq_pad=${4},hidden_size=${5},batch_size=${6}"
  paddle train --job=time \
    --config=$cfg \
    --use_gpu=1 \
    --trainer_count=$thread \
    --log_period=10 \
    --test_period=100 \
    --num_passes=1 \
    --feed_data=1 \
    --config_args=$args \
    >logs/rnn-pad${4}-${thread}gpu-lstm${3}-batch${6}-hid${5}.log 2>&1
 }
 if [ ! -d "logs" ]; then
  mkdir logs
 fi
 ## padding, single gpu
 #-----config--gpu--lstm_num--padding--hidden_size--batch_size
 ## lstm_num=2, batch_size=64
 train rnn.py 1 2 1 256 64 
 train rnn.py 1 2 1 512 64 
 train rnn.py 1 2 1 1280 64 
 ## lstm_num=2, batch_size=128
 train rnn.py 1 2 1 256 128 
 train rnn.py 1 2 1 512 128 
 train rnn.py 1 2 1 1280 128 
 ## lstm_num=4, batch_size=256
 train rnn.py 1 2 1 256 256 
 train rnn.py 1 2 1 512 256 
 train rnn.py 1 2 1 1280 256 
--- a/benchmark/paddle/rnn/run_multi.sh
+++ b/benchmark/paddle/rnn/run_multi.sh
@ -0,0 +1,34 @@
 set -e
 function train() {
  cfg=$1
  thread=$2
  args="lstm_num=${3},seq_pad=${4},hidden_size=${5},batch_size=${6}"
  paddle train --job=time \
    --config=$cfg \
    --use_gpu=1 \
    --trainer_count=$thread \
    --log_period=10 \
    --test_period=100 \
    --num_passes=1 \
    --feed_data=1 \
    --config_args=$args \
    >logs/rnn-pad${4}-${thread}gpu-lstm${3}-hid${5}-batch${6}.log 2>&1
 }
 if [ ! -d "logs" ]; then
  mkdir logs
 fi
 #-----config--gpu--lstm_num--padding--hidden_size--batch_size
 #==================multi gpus=====================#
 # hidden_size=256, lstm_num=2, different batch size
 train rnn.py 4 2 1 256 128 
 train rnn.py 4 2 1 256 256 
 train rnn.py 4 2 1 256 512 
 # hidden_size=512, lstm_num=4, different batch size
 train rnn.py 4 2 1 512 128 
 train rnn.py 4 2 1 512 256 
 train rnn.py 4 2 1 512 512 
--- a/benchmark/tensorflow/image/alexnet.py
+++ b/benchmark/tensorflow/image/alexnet.py
--- a/benchmark/tensorflow/image/alexnet_multi_gpu.py
+++ b/benchmark/tensorflow/image/alexnet_multi_gpu.py
--- a/benchmark/tensorflow/image/googlenet.py
+++ b/benchmark/tensorflow/image/googlenet.py
--- a/Show More
+++ b/Show More