Performance for Distributed vgg16

Test Result

Setting environment variable: MKL_NUM_THREADS=1.

Batch Size	32	64	128	256
PaddlePaddle Fluid	15.44	16.32	16.74	16.79
PaddlePaddle v2	15.97	17.04	17.60	17.83
TensorFlow	9.09	9.10	9.24	8.66

Batch Size	32	64	128	256
PaddlePaddle Fluid	190.20	222.15	247.40	258.18
PaddlePaddle v2	170.96	233.71	256.14	329.23
TensorFlow	-	-	-	-

Trainer Count	20	40	80	100
PaddlePaddle Fluid	263.29 (78.64%)	518.80 (77.47%)	836.26 (62.44%)	1019.29 (60.89%)
PaddlePaddle v2 (need more tests)	326.85 (92.85%)	534.58 (75.93%)	853.30 (60.60%)	1041.99 (59.20%)
TensorFlow	-	-	-	-

PServer Count	3	6	10	20
PaddlePaddle Fluid(should fix in next PR)	589.1	592.6	656.4	655.8
PaddlePaddle v2 (need more tests)	593.4	791.3	729.7	821.7
TensorFlow	-	-	-	-

The performance gap between Fuild and v2 comes from the network interference.

You must re-compile PaddlePaddle and enable -DWITH_DISTRIBUTE to build PaddlePaddle with distributed support.
When the build finishes, copy the output whl package located under build/python/dist to current directory.
Run docker build -t [image:tag] . to build the docker image and run docker push [image:tag] to push the image to reponsitory so kubernetes can find it.
Run kubectl create -f pserver.yaml && kubectl create -f trainer.yaml to start the job on your kubernetes cluster (you must configure the kubectl client before this step).
Run kubectl get po to get running pods, and run kubectl logs [podID] to fetch the pod log of pservers and trainers.

Check the logs for the distributed training progress and analyze the performance.

Edit pserver.yaml and trainer.yaml and add an environment variable GLOG_v=3 and GLOG_logtostderr=1 to see what happend in detail.