You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
Paddle/benchmark/cluster/vgg16/fluid/README.md

975 B

Fluid distributed training perf test

Steps to get started

  1. You must re-compile PaddlePaddle and enable -DWITH_DISTRIBUTE to build PaddlePaddle with distributed support.
  2. When the build finishes, copy the output whl package located under build/python/dist to current directory.
  3. Run docker build -t [image:tag] . to build the docker image and run docker push [image:tag] to push the image to reponsitory so kubernetes can find it.
  4. Run kubectl create -f pserver.yaml && kubectl create -f trainer.yaml to start the job on your kubernetes cluster (you must configure the kubectl client before this step).
  5. Run kubectl get po to get running pods, and run kubectl logs [podID] to fetch the pod log of pservers and trainers.

Check the logs for the distributed training progress and analyze the performance.

Enable verbos logs

Edit pserver.yaml and trainer.yaml and add an environment variable GLOG_v=3 to see what happend in detail.