|
7 years ago | |
---|---|---|
.. | ||
vgg16 | 7 years ago | |
README.md | 7 years ago |
README.md
Cluster Training Benchmark
Setup
-
Platform
- Kubernetes: v1.6.2
- Linux Kernel: v3.10.0
-
Resource
- CPU: 10 Cores per Pod
- Memory: 5GB per Pod
-
Docker Image
We use different base Docker Image to run the benchmark on Kubernetes:
- PaddlePaddle v2: paddlepaddle/paddle:0.11.0
- PaddlePaddle Fluid: paddlepaddle/paddle:[commit-id]
- TensorFlow: tensorflow/tensorflow:1.5.0-rc0
-
Model vgg16 is used in this benchmark.
Cases
-
Variable
- Batch Size of training data.
- PServer count of the training job.
- The number of trainers.
-
Invariant
- The resource of trainer/pserver Pod.
Measure the Performance for Different Batch Size
- PServer Count: 40
- Trainer Count: 100
- Metrics: mini-batch / sec
Batch Size | 32 | 64 | 128 | 256 |
---|---|---|---|---|
PaddlePaddle Fluid | - | - | - | - |
PaddlePaddle v2 | - | - | - | - |
TensorFlow | - | - | - | - |
Measure the Performance for Different PServer Count
- Trainer Count: 100
- Batch Size: 64
- Metrics: mini-batch / sec
PServer Count | 10 | 20 | 40 | 60 |
---|---|---|---|---|
PaddlePaddle Fluid | - | - | - | - |
PaddlePaddle v2 | - | - | - | - |
TensorFlow | - | - | - | - |
Measure Parallel Efficiency By Increasing Trainer Count
- PServer Count: 20
- Batch Size: 64
- Metrics:
S = \div(T1, TN)
which S is the ratio of T1 over TN, training time of 1 and N trainers. The parallel efficiency is:
E = \div(S, N)
Trainer Counter | 1 | 10 | 20 | 30 | 40 | 50 | 60 | 70 | 80 | 90 | 100 |
---|---|---|---|---|---|---|---|---|---|---|---|
PaddlePaddle Fluid | - | - | - | - | - | - | - | - | - | - | - |
PaddlePaddle v2 | - | - | - | - | - | - | - | - | - | - | - |
TensorFlow | - | - | - | - | - | - | - | - | - | - | - |
Reproduce the benchmark
TODO