History

Xin Pan 76d8b14bce Add timeline support for distributed training		7 years ago
..
Dockerfile	Tensorflow benchmark (#8522 )	7 years ago
README.md	Fix tables display error	7 years ago
fluid_pserver.yaml	update dist benchmark to one image	7 years ago
fluid_trainer.yaml	Tensorflow benchmark (#8522 )	7 years ago
run_vgg_dist.sh	follow comments	7 years ago
tf_k8s	Tensorflow benchmark (#8522 )	7 years ago
tf_pserver.yaml	Tensorflow benchmark (#8522 )	7 years ago
tf_trainer.yaml	Tensorflow benchmark (#8522 )	7 years ago
v2_pserver.yaml	add others	7 years ago
v2_trainer.yaml	add others	7 years ago
vgg16_fluid.py	Add timeline support for distributed training	7 years ago
vgg16_tf.py	Reuduce memory copy when communication between trainer and pserver. (#9271 )	7 years ago
vgg16_v2.py	Fix the grammar in copyright. (#8403 )	7 years ago

Performance for Distributed vgg16

Test Result

Setting environment variable: MKL_NUM_THREADS=1.

Batch Size	32	64	128	256
PaddlePaddle Fluid	15.44	16.32	16.74	16.79
PaddlePaddle v2	15.97	17.04	17.60	17.83
TensorFlow	9.09	9.10	9.24	8.66

Batch Size	32	64	128	256
PaddlePaddle Fluid	190.20	222.15	247.40	258.18
PaddlePaddle v2	170.96	233.71	256.14	329.23
TensorFlow	-	-	-	-

Trainer Count	20	40	80	100
PaddlePaddle Fluid	263.29 (78.64%)	518.80 (77.47%)	836.26 (62.44%)	1019.29 (60.89%)
PaddlePaddle v2 (need more tests)	326.85 (92.85%)	534.58 (75.93%)	853.30 (60.60%)	1041.99 (59.20%)
TensorFlow	-	-	-	-

PServer Count	3	6	10	20
PaddlePaddle Fluid(should fix in next PR)	589.1	592.6	656.4	655.8
PaddlePaddle v2 (need more tests)	593.4	791.3	729.7	821.7
TensorFlow	-	-	-	-

The performance gap between Fuild and v2 comes from the network interference.

You must re-compile PaddlePaddle and enable -DWITH_DISTRIBUTE to build PaddlePaddle with distributed support.
When the build finishes, copy the output whl package located under build/python/dist to current directory.
Run docker build -t [image:tag] . to build the docker image and run docker push [image:tag] to push the image to reponsitory so kubernetes can find it.
Run kubectl create -f pserver.yaml && kubectl create -f trainer.yaml to start the job on your kubernetes cluster (you must configure the kubectl client before this step).
Run kubectl get po to get running pods, and run kubectl logs [podID] to fetch the pod log of pservers and trainers.

Check the logs for the distributed training progress and analyze the performance.

Edit pserver.yaml and trainer.yaml and add an environment variable GLOG_v=3 and GLOG_logtostderr=1 to see what happend in detail.