You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
Paddle/benchmark/cluster
gongweibao 8c0434c318
Add single node tensorflow benchmark. (#8513)
7 years ago
..
vgg16 Add single node tensorflow benchmark. (#8513) 7 years ago
README.md clean code 7 years ago

README.md

Cluster Training Benchmark

Setup

  • Platform

    • Kubernetes: v1.6.2
    • Linux Kernel: v3.10.0
  • Resource

    • CPU: 10 Cores per Pod
    • Memory: 5GB per Pod
  • Docker Image

    We use different base Docker Image to run the benchmark on Kubernetes:

    • PaddlePaddle v2: paddlepaddle/paddle:0.11.0
    • PaddlePaddle Fluid: paddlepaddle/paddle:[commit-id]
    • TensorFlow: tensorflow/tensorflow:1.5.0-rc0
  • Model vgg16 is used in this benchmark.

Cases

  • Variable

    • Batch Size of training data.
    • PServer count of the training job.
    • The number of trainers.
  • Invariant

    • The resource of trainer/pserver Pod.

Measure the Performance for Different Batch Size

  • PServer Count: 40
  • Trainer Count: 100
  • Metrics: mini-batch / sec
Batch Size 32 64 128 256
PaddlePaddle Fluid - - - -
PaddlePaddle v2 - - - -
TensorFlow - - - -

Measure the Performance for Different PServer Count

  • Trainer Count: 100
  • Batch Size: 64
  • Metrics: mini-batch / sec
PServer Count 10 20 40 60
PaddlePaddle Fluid - - - -
PaddlePaddle v2 - - - -
TensorFlow - - - -

Measure Parallel Efficiency By Increasing Trainer Count

  • PServer Count: 20
  • Batch Size: 64
  • Metrics:

S = \div(T1, TN)

which S is the ratio of T1 over TN, training time of 1 and N trainers. The parallel efficiency is:

E = \div(S, N)

Trainer Counter 1 10 20 30 40 50 60 70 80 90 100
PaddlePaddle Fluid - - - - - - - - - - -
PaddlePaddle v2 - - - - - - - - - - -
TensorFlow - - - - - - - - - - -

Reproduce the benchmark

TODO