refine cluster doc structure

7 years ago · ca0b5caece
parent 5ac72d9501
commit ca0b5caece
23 changed files with 14 additions and 1868 deletions
--- a/doc/howto/usage/cluster/cluster_train_cn.md
+++ b/doc/howto/usage/cluster/cluster_train_cn.md
@ -1,25 +1,8 @@
 # PaddlePaddle分布式训练
 * [概述](#概述)
 * [环境准备](#环境准备)
 * [启动参数说明](#启动参数说明)
  * [启动参数服务器](#启动参数服务器)
  * [启动计算节点](#启动计算节点)
  * [准备数据集](#准备数据集)
  * [准备训练程序](#准备训练程序)
 * [使用分布式计算平台或工具](#使用分布式计算平台或工具)
  * [使用Fabric启动集群作业](#使用fabric启动集群作业)
     * [准备一个Linux集群](#准备一个linux集群)
     * [启动集群作业](#启动集群作业)
     * [终止集群作业](#终止集群作业)
     * [检查集群训练结果](#检查集群训练结果)
     * [检查模型输出](#检查模型输出)
  * [在OpenMPI集群中提交训练作业](#在openmpi集群中提交训练作业)
     * [准备OpenMPI集群](#准备OpenMPI集群)
     * [启动集群作业](#启动集群作业-1)
  * [在Kubernetes集群中提交训练作业](#在kubernetes集群中提交训练作业)
 ## 概述
 本文将介绍如何使用PaddlePaddle在不同的集群框架下完成分布式训练。分布式训练架构如下图所示：
 <img src="https://user-images.githubusercontent.com/13348433/31772175-5f419eca-b511-11e7-9db7-5231fe3d9ccb.png" width="500">
@ -32,6 +15,7 @@
 在使用同步SGD训练神经网络时，PaddlePaddle使用同步屏障（barrier），使梯度的提交和参数的更新按照顺序方式执行。在异步SGD中，则并不会等待所有trainer提交梯度才更新参数，这样极大地提高了计算的并行性：参数服务器之间不相互依赖，并行地接收梯度和更新参数，参数服务器也不会等待计算节点全部都提交梯度之后才开始下一步，计算节点之间也不会相互依赖，并行地执行模型的训练。可以看出，虽然异步SGD方式会提高参数更新并行度, 但是并不能保证参数同步更新，在任意时间某一台参数服务器上保存的参数可能比另一台要更新，与同步SGD相比，梯度会有噪声。
 ## 环境准备
 1. 准备您的计算集群。计算集群通常由一组（几台到几千台规模）的Linux服务器组成。服务器之间可以通过局域网（LAN）联通，每台服务器具有集群中唯一的IP地址（或者可被DNS解析的主机名）。集群中的每台计算机通常被成为一个“节点”。
@ -195,91 +179,10 @@ PaddlePaddle可以使用多种分布式计算平台构建分布式计算任务
 在使用分布式计算平台进行训练时，任务被调度在集群中时，分布式计算平台通常会通过API或者环境变量提供任务运行需要的参数，比如节点的ID、IP和任务节点个数等。
-### 使用Fabric启动集群作业
+## 在不同集群中运行
 #### 准备一个Linux集群
 可以在`paddle/scripts/cluster_train_v2/fabric/docker_cluster`目录下，执行`kubectl -f ssh_servers.yaml`启动一个测试集群，并使用`kubectl get po -o wide`获得这些节点的IP地址。
 #### 启动集群作业
 `paddle.py` 提供了自动化脚本来启动不同节点中的所有 PaddlePaddle 集群进程。默认情况下，所有命令行选项可以设置为 `paddle.py` 命令选项并且 `paddle.py` 将透明、自动地将这些选项应用到 PaddlePaddle 底层进程。
 `paddle.py` 为方便作业启动提供了两个独特的命令选项。
 -  `job_dispatch_package`  设为本地 `workspace` 目录，它将被分发到 `conf.py` 中设置的所有节点。它有助于帮助频繁修改和访问工作区文件的用户减少负担，否则频繁的多节点工作空间部署可能会很麻烦。
 -  `job_workspace`  设为已部署的工作空间目录，`paddle.py` 将跳过分发阶段直接启动所有节点的集群作业。它可以帮助减少分发延迟。
 `cluster_train/run.sh` 提供了命令样例来运行 `doc/howto/usage/cluster/src/word2vec` 集群任务，只需用您定义的目录修改 `job_dispatch_package` 和 `job_workspace`，然后：
 ```
 sh run.sh
 ```
 集群作业将会在几秒后启动。
 #### 终止集群作业
 `paddle.py`能获取`Ctrl + C` SIGINT 信号来自动终止它启动的所有进程。只需中断 `paddle.py` 任务来终止集群作业。如果程序崩溃你也可以手动终止。
 #### 检查集群训练结果
 详细信息请检查 $workspace/log 里的日志，每一个节点都有相同的日志结构。
 `paddle_trainer.INFO`
 提供几乎所有训练的内部输出日志，与本地训练相同。这里检验运行时间模型的收敛。
 `paddle_pserver2.INFO`
 提供 pserver 运行日志，有助于诊断分布式错误。
 `server.log`
 提供 parameter server 进程的 stderr 和 stdout。训练失败时可以检查错误日志。
 `train.log`
 提供训练过程的 stderr 和 stdout。训练失败时可以检查错误日志。
 #### 检查模型输出
 运行完成后，模型文件将被写入节点 0 的 `output` 目录中。
 工作空间中的 `nodefile` 表示当前集群作业的节点 ID。
 ### 在OpenMPI集群中提交训练作业
 #### 准备OpenMPI集群
 执行下面的命令以启动3个节点的OpenMPI集群和一个"head"节点：
 ```bash
 paddle/scripts/cluster_train_v2/openmpi/docker_cluster
 kubectl create -f head.yaml
 kubectl create -f mpi-nodes.yaml
 ```
 然后可以从head节点ssh无密码登录到OpenMPI的每个节点上。
 #### 启动集群作业
 您可以按照下面的步骤在OpenMPI集群中提交paddle训练任务：
 ```bash
 # 获得head和node节点的IP地址
 kubectl get po -o wide
 # 将node节点的IP地址保存到machines文件中
 kubectl get po -o wide | grep nodes | awk '{print $6}' > machines
 # 拷贝必要的文件到head节点
 scp -i ssh/id_rsa.mpi.pub machines prepare.py train.py start_mpi_train.sh tutorial@[headIP]:~
 # ssh 登录到head节点
 ssh -i ssh/id_rsa.mpi.pub tutorial@[headIP]
 # --------------- 以下操作均在head节点中执行 ---------------
 # 准备训练数据
 python prepare.py
 # 拷贝训练程序和字典文件到每台MPI节点
 cat machines | xargs -i scp word_dict.pickle train.py start_mpi_train.sh machines {}:/home/tutorial
 # 创建日志目录
 mpirun -hostfile machines -n 3 mkdir /home/tutorial/logs
 # 拷贝训练数据到各自的节点
 scp train.txt-00000 test.txt-00000 [node1IP]:/home/tutorial
 scp train.txt-00001 test.txt-00001 [node2IP]:/home/tutorial
 scp train.txt-00002 test.txt-00002 [node3IP]:/home/tutorial
 # 启动训练任务
 mpirun -hostfile machines -n 3  /home/tutorial/start_mpi_train.sh
 ```
 ### 在Kubernetes集群中提交训练作业
-此部分的使用方法可以参考[here](../k8s/k8s_distributed_cn.md)。
+  [fabric](fabric_cn.md)
  [opemmpi](openmpi_cn.md)
  [kubernetes](k8s_cn.md)
  [kubernetes distributed](k8s_distributed_cn.md)
  [kubernetes on AWS](k8s_aws_en.md)
--- a/doc/howto/usage/cluster/cluster_train_en.md
+++ b/doc/howto/usage/cluster/cluster_train_en.md
@ -1,24 +1,5 @@
 # PaddlePaddle Distributed Training
 * [Introduction](#introduction)
 * [Preparations](#preparations)
 * [Command-line arguments](#command-line-arguments)
   * [Starting parameter server](#starting-parameter-server)
   * [Starting trainer](#starting-trainer)
   * [Prepare Training Dataset](#prepare-training-dataset)
   * [Prepare Training program](#prepare-training-program)
 * [Use cluster platforms or cluster management tools](#use-cluster-platforms-or-cluster-management-tools)
   * [Cluster Training Using Fabric](#cluster-training-using-fabric)
      * [Prepare a Linux cluster](#prepare-a-linux-cluster)
      * [Launching Cluster Job](#launching-cluster-job)
      * [Kill Cluster Job](#kill-cluster-job)
      * [Check Cluster Training Result](#check-cluster-training-result)
      * [Check Model Output](#check-model-output)
   * [Cluster Training Using OpenMPI](#cluster-training-using-openmpi)
      * [Prepare an OpenMPI cluster](#prepare-an-openmpi-cluster)
      * [Launching Cluster Job](#launching-cluster-job-1)
   * [Cluster Training Using Kubernetes](#cluster-training-using-kubernetes)
 ## Introduction
 In this article, we'll explain how to run distributed training jobs with PaddlePaddle on different types of clusters. The diagram below shows the main architecture of a distributed trainning job:
@ -202,92 +183,10 @@ We'll introduce cluster job management on these platforms. The examples can be f
 These cluster platforms provide API or environment variables for training processes, when the job is dispatched to different nodes. Like node ID, IP or total number of nodes etc.
-### Cluster Training Using Fabric
+## Use different clusters
 #### Prepare a Linux cluster
 Run `kubectl -f ssh_servers.yaml` under the directory:  `paddle/scripts/cluster_train_v2/fabric/docker_cluster` will launch a demo cluster. Run `kubectl get po -o wide` to get IP addresses of these nodes.
 #### Launching Cluster Job
 `paddle.py` provides automatical scripts to start all PaddlePaddle cluster processes in different nodes. By default, all command line options can be set as `paddle.py` command options and `paddle.py` will transparently and automatically set these options to PaddlePaddle lower level processes.
 `paddle.py`provides two distinguished command option for easy job launching.
 - `job_dispatch_package` set it with local `workspace` directory, it will be dispatched to all nodes which is set in `conf.py`. It could be helpful for frequently manipulating workspace files. otherwise, frequent multi-nodes workspace deployment is very annoying.
 - `job_workspace`  set it with already deployed workspace directory, `paddle.py` will skip dispatch stage to directly launch cluster job with all nodes. It could help to reduce heavy
 dispatch latency.
 `cluster_train/run.sh` provides command line sample to run `demo/recommendation` cluster job, just modify `job_dispatch_package` and `job_workspace` with your defined directory, then:
 ```
 sh run.sh
 ```
 The cluster Job will start in several seconds.
 #### Kill Cluster Job
 `paddle.py` can capture `Ctrl + C` SIGINT signal to automatically kill all processes launched by it. So just stop `paddle.py` to kill cluster job. You should manually kill the job if the program crashed.
 #### Check Cluster Training Result
 Check log in $workspace/log for details, each node owns same log structure.
 `paddle_trainer.INFO`
 It provides almost all internal output log for training,  same as local training. Check runtime model convergence here.
 `paddle_pserver2.INFO`
 It provides parameter server running log, which could help to diagnose distributed error.
 `server.log`
 It provides stderr and stdout of parameter server process. Check error log if training crashes.
 `train.log`
 It provides stderr and stdout of trainer process. Check error log if training crashes.
 #### Check Model Output
 After one pass finished, model files will be written in `output` directory in node 0.
 `nodefile` in workspace indicates the node id of current cluster job.
 ### Cluster Training Using OpenMPI
 #### Prepare an OpenMPI cluster
 Run the following command to start a 3-node MPI cluster and one "head" node.
 ```bash
 cd paddle/scripts/cluster_train_v2/openmpi/docker_cluster
 kubectl create -f head.yaml
 kubectl create -f mpi-nodes.yaml
 ```
 Then you can log in to every OpenMPI node using ssh without input any passwords.
 #### Launching Cluster Job
 Follow the steps to launch a PaddlePaddle training job in OpenMPI cluster:\
 ```bash
 # find out node IP addresses
 kubectl get po -o wide
 # generate a "machines" file containing node IP addresses
 kubectl get po -o wide | grep nodes | awk '{print $6}' > machines
 # copy necessary files onto "head" node
 scp -i ssh/id_rsa.mpi.pub machines prepare.py train.py start_mpi_train.sh tutorial@[headIP]:~
 # login to head node using ssh
 ssh -i ssh/id_rsa.mpi.pub tutorial@[headIP]
 # --------------- in head node ---------------
 # prepare training data
 python prepare.py
 # copy training data and dict file to MPI nodes
 cat machines | xargs -i scp word_dict.pickle train.py start_mpi_train.sh machines {}:/home/tutorial
 # creat a directory for storing log files
 mpirun -hostfile machines -n 3 mkdir /home/tutorial/logs
 # copy training data to every node
 scp train.txt-00000 test.txt-00000 [node1IP]:/home/tutorial
 scp train.txt-00001 test.txt-00001 [node2IP]:/home/tutorial
 scp train.txt-00002 test.txt-00002 [node3IP]:/home/tutorial
 # start the job
 mpirun -hostfile machines -n 3  /home/tutorial/start_mpi_train.sh
 ```
 ### Cluster Training Using Kubernetes
-The details can be found [here](../k8s/k8s_cn.md)
+  [fabric](fabric_cn.md)
  [opemmpi](openmpi_cn.md)
  [kubernetes](k8s_cn.md)
  [kubernetes distributed](k8s_distributed_cn.md)
  [kubernetes on AWS](k8s_aws_en.md)
--- a/doc/howto/usage/k8s/k8s_aws_en.md
+++ b/doc/howto/usage/k8s/k8s_aws_en.md
--- a/doc/howto/usage/k8s/k8s_cn.md
+++ b/doc/howto/usage/k8s/k8s_cn.md
@ -1,205 +0,0 @@
 # Kubernetes单机训练
 在这篇文档里，我们介绍如何在 Kubernetes 集群上启动一个单机使用CPU的Paddle训练作业。在下一篇中，我们将介绍如何启动分布式训练作业。
 ## 制作Docker镜像
 在一个功能齐全的Kubernetes机群里，通常我们会安装Ceph等分布式文件系统来存储训练数据。这样的话，一个分布式Paddle训练任务中的每个进程都可以从Ceph读取数据。在这个例子里，我们只演示一个单机作业，所以可以简化对环境的要求，把训练数据直接放在
 Paddle的Docker image里。为此，我们需要制作一个包含训练数据的Paddle镜像。
 Paddle 的 [Quick Start Tutorial](http://www.paddlepaddle.org/doc/demo/quick_start/index_en.html) 
 里介绍了用Paddle源码中的脚本下载训练数据的过程。
 而 `paddledev/paddle:cpu-demo-latest` 镜像里有 Paddle 源码与demo，（ 请注意，默认的
 Paddle镜像 `paddledev/paddle:cpu-latest` 是不包括源码的, Paddle的各版本镜像可以参考 [Docker installation guide](http://www.paddlepaddle.org/doc/build/docker_install.html) ），所以我们使用这个镜像来下载训练数据到Docker container中，然后把这个包含了训练数据的container保存为一个新的镜像。
 ### 运行容器
 ```
 $ docker run --name quick_start_data -it paddledev/paddle:cpu-demo-latest
 ```
 ### 下载数据
 进入容器`/root/paddle/demo/quick_start/data`目录，使用`get_data.sh`下载数据
 ```
 $ root@fbd1f2bb71f4:~/paddle/demo/quick_start/data# ./get_data.sh
 Downloading Amazon Electronics reviews data...
 --2016-10-31 01:33:43--  http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Electronics_5.json.gz
 Resolving snap.stanford.edu (snap.stanford.edu)... 171.64.75.80
 Connecting to snap.stanford.edu (snap.stanford.edu)|171.64.75.80|:80... connected.
 HTTP request sent, awaiting response... 200 OK
 Length: 495854086 (473M) [application/x-gzip]
 Saving to: 'reviews_Electronics_5.json.gz'
 10% [=======>                                         ] 874,279     64.7KB/s  eta 2h 13m
 ```
 ### 修改启动脚本
 下载完数据后，修改`/root/paddle/demo/quick_start/train.sh`文件，内容如下（增加了一条cd命令）
 ```
 set -e
 cd /root/paddle/demo/quick_start
 cfg=trainer_config.lr.py
 #cfg=trainer_config.emb.py
 #cfg=trainer_config.cnn.py
 #cfg=trainer_config.lstm.py
 #cfg=trainer_config.bidi-lstm.py
 #cfg=trainer_config.db-lstm.py
 paddle train \
  --config=$cfg \
  --save_dir=./output \
  --trainer_count=4 \
  --log_period=20 \
  --num_passes=15 \
  --use_gpu=false \
  --show_parameter_stats_period=100 \
  --test_all_data_in_one_period=1 \
  2>&1 | tee 'train.log'
 ```
 ### 提交镜像
 修改启动脚本后，退出容器，使用`docker commit`命令创建新镜像。
 ```
 $ docker commit quick_start_data mypaddle/paddle:quickstart
 ```
 ## 使用 Kubernetes 进行训练
 >针对任务运行完成后容器自动退出的场景，Kubernetes有Job类型的资源来支持。下文就是用Job类型的资源来进行训练。
 ### 编写yaml文件
 在训练时，输出结果可能会随着容器的消耗而被删除，需要在创建容器前挂载卷以便我们保存训练结果。使用我们之前构造的镜像，可以创建一个 [Kubernetes Job](http://kubernetes.io/docs/user-guide/jobs/#what-is-a-job)，简单的yaml文件如下：
 ```
 apiVersion: batch/v1
 kind: Job
 metadata:
  name: quickstart
 spec:
  parallelism: 1
  completions: 1
  template:
    metadata:
      name: quickstart
    spec:
      volumes:
      - name: output
        hostPath: 
          path: /home/work/paddle_output     
      containers:
      - name: pi
        image: mypaddle/paddle:quickstart
        command: ["bin/bash",  "-c", "/root/paddle/demo/quick_start/train.sh"]
        volumeMounts:
        - name: output
          mountPath: /root/paddle/demo/quick_start/output
      restartPolicy: Never
 ```
 ### 创建Paddle Job
 使用上文创建的yaml文件创建Kubernetes Job，命令为：
 ```
 $ kubectl  create -f paddle.yaml
 ```
 查看job的详细情况：
 ```
 $ kubectl  get job
 NAME         DESIRED   SUCCESSFUL   AGE
 quickstart   1         0            58s
 $ kubectl  describe job quickstart
 Name:		quickstart
 Namespace:	default
 Image(s):	registry.baidu.com/public/paddle:cpu-demo-latest
 Selector:	controller-uid=f120da72-9f18-11e6-b363-448a5b355b84
 Parallelism:	1
 Completions:	1
 Start Time:	Mon, 31 Oct 2016 11:20:16 +0800
 Labels:		controller-uid=f120da72-9f18-11e6-b363-448a5b355b84,job-name=quickstart
 Pods Statuses:	0 Running / 1 Succeeded / 0 Failed
 Volumes:
  output:
    Type:	HostPath (bare host directory volume)
    Path:	/home/work/paddle_output
 Events:
  FirstSeen	LastSeen	Count	From			SubobjectPath	Type		Reason			Message
  ---------	--------	-----	----			-------------	--------	------			-------
  1m		1m		1	{job-controller }			Normal		SuccessfulCreate	Created pod: quickstart-fa0wx
 ```
 ### 查看训练结果
 根据Job对应的Pod信息，可以查看此Pod运行的宿主机。
 ```
 kubectl  describe pod quickstart-fa0wx
 Name:		quickstart-fa0wx
 Namespace:	default
 Node:		paddle-demo-let02/10.206.202.44
 Start Time:	Mon, 31 Oct 2016 11:20:17 +0800
 Labels:		controller-uid=f120da72-9f18-11e6-b363-448a5b355b84,job-name=quickstart
 Status:		Succeeded
 IP:		10.0.0.9
 Controllers:	Job/quickstart
 Containers:
  quickstart:
    Container ID:	docker://b8561f5c79193550d64fa47418a9e67ebdd71546186e840f88de5026b8097465
    Image:		registry.baidu.com/public/paddle:cpu-demo-latest
    Image ID:		docker://18e457ce3d362ff5f3febf8e7f85ffec852f70f3b629add10aed84f930a68750
    Port:
    Command:
      bin/bash
      -c
      /root/paddle/demo/quick_start/train.sh
    QoS Tier:
      cpu:		BestEffort
      memory:		BestEffort
    State:		Terminated
      Reason:		Completed
      Exit Code:	0
      Started:		Mon, 31 Oct 2016 11:20:20 +0800
      Finished:		Mon, 31 Oct 2016 11:21:46 +0800
    Ready:		False
    Restart Count:	0
    Environment Variables:
 Conditions:
  Type		Status
  Ready 	False
 Volumes:
  output:
    Type:	HostPath (bare host directory volume)
    Path:	/home/work/paddle_output
 ```
 我们还可以登录到宿主机上查看训练结果。
 ```
 [root@paddle-demo-let02 paddle_output]# ll
 total 60
 drwxr-xr-x 2 root root 4096 Oct 31 11:20 pass-00000
 drwxr-xr-x 2 root root 4096 Oct 31 11:20 pass-00001
 drwxr-xr-x 2 root root 4096 Oct 31 11:21 pass-00002
 drwxr-xr-x 2 root root 4096 Oct 31 11:21 pass-00003
 drwxr-xr-x 2 root root 4096 Oct 31 11:21 pass-00004
 drwxr-xr-x 2 root root 4096 Oct 31 11:21 pass-00005
 drwxr-xr-x 2 root root 4096 Oct 31 11:21 pass-00006
 drwxr-xr-x 2 root root 4096 Oct 31 11:21 pass-00007
 drwxr-xr-x 2 root root 4096 Oct 31 11:21 pass-00008
 drwxr-xr-x 2 root root 4096 Oct 31 11:21 pass-00009
 drwxr-xr-x 2 root root 4096 Oct 31 11:21 pass-00010
 drwxr-xr-x 2 root root 4096 Oct 31 11:21 pass-00011
 drwxr-xr-x 2 root root 4096 Oct 31 11:21 pass-00012
 drwxr-xr-x 2 root root 4096 Oct 31 11:21 pass-00013
 drwxr-xr-x 2 root root 4096 Oct 31 11:21 pass-00014
 ```
--- a/doc/howto/usage/k8s/k8s_distributed_cn.md
+++ b/doc/howto/usage/k8s/k8s_distributed_cn.md
--- a/doc/howto/usage/k8s/k8s_en.md
+++ b/doc/howto/usage/k8s/k8s_en.md
@ -1,201 +0,0 @@
 # Paddle On Kubernetes
 >In this article, we will introduce how to run Paddle training job on single CPU machine using Kubernetes. In next article, we will introduce how to run Paddle training job on distributed cluster.
 ## Build Docker Image
 In distributed Kubernetes cluster, we will use Ceph or other shared storage system for storing training related data so that all processes in Paddle training can retrieve data from Ceph. In this example, we will only demo training job on single machine. In order to simplify the requirement of the environment, we will directly put training data into Paddle's Docker Image, so we need to create a Paddle Docker image that already includes the training data.
 Paddle's [Quick Start Tutorial](http://www.paddlepaddle.org/doc/demo/quick_start/index_en.html) introduces how to download and train data by using script from Paddle's source code.
 And `paddledev/paddle:cpu-demo-latest` image has the Paddle source code and demo. (Caution: Default Paddle image `paddledev/paddle:cpu-latest` doesn't include the source code, Paddle's different versions of image can be referred here: [Docker installation guide](http://www.paddlepaddle.org/doc/build/docker_install.html)), so we run this container and download the training data, and then commit the whole container to be a new Docker image.
 ### Run Docker Container
 ```
 $ docker run --name quick_start_data -it paddledev/paddle:cpu-demo-latest
 ```
 ### Download Training Data
 Getting into `/root/paddle/demo/quick_start/data` Directory，using `get_data.sh` to download training data.
 Then getting into `/root/paddle/demo/quick_start` Directory, using `preprocess.sh` to pre-process training data.
 ```
 $ root@fbd1f2bb71f4:~/paddle/demo/quick_start/data# ./get_data.sh
 Downloading Amazon Electronics reviews data...
 --2016-10-31 01:33:43--  http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Electronics_5.json.gz
 Resolving snap.stanford.edu (snap.stanford.edu)... 171.64.75.80
 Connecting to snap.stanford.edu (snap.stanford.edu)|171.64.75.80|:80... connected.
 HTTP request sent, awaiting response... 200 OK
 Length: 495854086 (473M) [application/x-gzip]
 Saving to: 'reviews_Electronics_5.json.gz'
 10% [=======>                                         ] 874,279     64.7KB/s  eta 2h 13m
 ```
 ### Modify Startup Script
 After downloading the data，modify `/root/paddle/demo/quick_start/train.sh` file contents are as follows (one more cd cmd):
 ```
 set -e
 cd /root/paddle/demo/quick_start
 cfg=trainer_config.lr.py
 #cfg=trainer_config.emb.py
 #cfg=trainer_config.cnn.py
 #cfg=trainer_config.lstm.py
 #cfg=trainer_config.bidi-lstm.py
 #cfg=trainer_config.db-lstm.py
 paddle train \
  --config=$cfg \
  --save_dir=./output \
  --trainer_count=4 \
  --log_period=20 \
  --num_passes=15 \
  --use_gpu=false \
  --show_parameter_stats_period=100 \
  --test_all_data_in_one_period=1 \
  2>&1 | tee 'train.log'
 ```
 ### Commit Docker Image
 ```
 $ docker commit quick_start_data mypaddle/paddle:quickstart
 ```
 ## Use Kubernetes For Training
 >We will use Kubernetes job for training process, following steps shows how to do the training with Kubernetes.
 ### Create Yaml Files
 The output result in container will be demolished when job finished (container stopped running), so we need to mount the volume out to the local disk when creating the container to store the training result. Using our previously created image, we can create a [Kubernetes Job](http://kubernetes.io/docs/user-guide/jobs/#what-is-a-job), the yaml contents are as follows:
 ```
 apiVersion: batch/v1
 kind: Job
 metadata:
  name: quickstart
 spec:
  parallelism: 1
  completions: 1
  template:
    metadata:
      name: quickstart
    spec:
      volumes:
      - name: output
        hostPath: 
          path: /home/work/paddle_output     
      containers:
      - name: pi
        image: mypaddle/paddle:quickstart
        command: ["bin/bash",  "-c", "/root/paddle/demo/quick_start/train.sh"]
        volumeMounts:
        - name: output
          mountPath: /root/paddle/demo/quick_start/output
      restartPolicy: Never
 ```
 ### Start Paddle Job
 Using the above yaml file to start the Kubernetes job.
 ```
 $ kubectl  create -f paddle.yaml
 ```
 Get the detailed status of the job:
 ```
 $ kubectl  get job
 NAME         DESIRED   SUCCESSFUL   AGE
 quickstart   1         0            58s
 $ kubectl  describe job quickstart
 Name:		quickstart
 Namespace:	default
 Image(s):	registry.baidu.com/public/paddle:cpu-demo-latest
 Selector:	controller-uid=f120da72-9f18-11e6-b363-448a5b355b84
 Parallelism:	1
 Completions:	1
 Start Time:	Mon, 31 Oct 2016 11:20:16 +0800
 Labels:		controller-uid=f120da72-9f18-11e6-b363-448a5b355b84,job-name=quickstart
 Pods Statuses:	0 Running / 1 Succeeded / 0 Failed
 Volumes:
  output:
    Type:	HostPath (bare host directory volume)
    Path:	/home/work/paddle_output
 Events:
  FirstSeen	LastSeen	Count	From			SubobjectPath	Type		Reason			Message
  ---------	--------	-----	----			-------------	--------	------			-------
  1m		1m		1	{job-controller }			Normal		SuccessfulCreate	Created pod: quickstart-fa0wx
 ```
 ### Get Training Result
 We can use kubectl command to take a look at the status of related pod.
 ```
 $ kubectl  describe pod quickstart-fa0wx
 Name:		quickstart-fa0wx
 Namespace:	default
 Node:		paddle-demo-let02/10.206.202.44
 Start Time:	Mon, 31 Oct 2016 11:20:17 +0800
 Labels:		controller-uid=f120da72-9f18-11e6-b363-448a5b355b84,job-name=quickstart
 Status:		Succeeded
 IP:		10.0.0.9
 Controllers:	Job/quickstart
 Containers:
  quickstart:
    Container ID:	docker://b8561f5c79193550d64fa47418a9e67ebdd71546186e840f88de5026b8097465
    Image:		registry.baidu.com/public/paddle:cpu-demo-latest
    Image ID:		docker://18e457ce3d362ff5f3febf8e7f85ffec852f70f3b629add10aed84f930a68750
    Port:
    Command:
      bin/bash
      -c
      /root/paddle/demo/quick_start/train.sh
    QoS Tier:
      cpu:		BestEffort
      memory:		BestEffort
    State:		Terminated
      Reason:		Completed
      Exit Code:	0
      Started:		Mon, 31 Oct 2016 11:20:20 +0800
      Finished:		Mon, 31 Oct 2016 11:21:46 +0800
    Ready:		False
    Restart Count:	0
    Environment Variables:
 Conditions:
  Type		Status
  Ready 	False
 Volumes:
  output:
    Type:	HostPath (bare host directory volume)
    Path:	/home/work/paddle_output
 ```
 We can also ssh to Kubernetes node to take a look at the training result.
 ```
 [root@paddle-demo-let02 paddle_output]# ll
 total 60
 drwxr-xr-x 2 root root 4096 Oct 31 11:20 pass-00000
 drwxr-xr-x 2 root root 4096 Oct 31 11:20 pass-00001
 drwxr-xr-x 2 root root 4096 Oct 31 11:21 pass-00002
 drwxr-xr-x 2 root root 4096 Oct 31 11:21 pass-00003
 drwxr-xr-x 2 root root 4096 Oct 31 11:21 pass-00004
 drwxr-xr-x 2 root root 4096 Oct 31 11:21 pass-00005
 drwxr-xr-x 2 root root 4096 Oct 31 11:21 pass-00006
 drwxr-xr-x 2 root root 4096 Oct 31 11:21 pass-00007
 drwxr-xr-x 2 root root 4096 Oct 31 11:21 pass-00008
 drwxr-xr-x 2 root root 4096 Oct 31 11:21 pass-00009
 drwxr-xr-x 2 root root 4096 Oct 31 11:21 pass-00010
 drwxr-xr-x 2 root root 4096 Oct 31 11:21 pass-00011
 drwxr-xr-x 2 root root 4096 Oct 31 11:21 pass-00012
 drwxr-xr-x 2 root root 4096 Oct 31 11:21 pass-00013
 drwxr-xr-x 2 root root 4096 Oct 31 11:21 pass-00014
 ```
--- a/doc/howto/usage/k8s/src/Dockerfile
+++ b/doc/howto/usage/k8s/src/Dockerfile
@ -1,7 +0,0 @@
 FROM paddledev/paddle:cpu-latest
 MAINTAINER zjsxzong89@gmail.com
 COPY start.sh /root/
 COPY start_paddle.py /root/
 CMD ["bash"," -c","/root/start.sh"]
--- a/doc/howto/usage/k8s/src/add_security_group.png
+++ b/doc/howto/usage/k8s/src/add_security_group.png
--- a/doc/howto/usage/k8s/src/create_efs.png
+++ b/doc/howto/usage/k8s/src/create_efs.png
--- a/doc/howto/usage/k8s/src/efs_mount.png
+++ b/doc/howto/usage/k8s/src/efs_mount.png
--- a/doc/howto/usage/k8s/src/k8s-paddle-arch.png
+++ b/doc/howto/usage/k8s/src/k8s-paddle-arch.png
--- a/doc/howto/usage/k8s/src/k8s_data/Dockerfile
+++ b/doc/howto/usage/k8s/src/k8s_data/Dockerfile
@ -1,7 +0,0 @@
 FROM alpine
 RUN apk update && apk upgrade && apk add coreutils
 ADD quick_start /quick_start
 ADD get_data.sh /bin/
 RUN chmod +x /bin/get_data.sh
 ENTRYPOINT ["/bin/get_data.sh"]
--- a/doc/howto/usage/k8s/src/k8s_data/README.md
+++ b/doc/howto/usage/k8s/src/k8s_data/README.md
@ -1,6 +0,0 @@
 To build PaddlePaddle data preparation image in tutorial [Distributed PaddlePaddle Training on AWS with Kubernetes](../../k8s_aws_en.md), run following commands:
 ```
 cp -r ../../../../../../demo/quick_start .
 docker build . -t prepare-data-image-name
 ```
--- a/doc/howto/usage/k8s/src/k8s_data/get_data.sh
+++ b/doc/howto/usage/k8s/src/k8s_data/get_data.sh
@ -1,26 +0,0 @@
 #!/bin/sh
 out_dir=$OUT_DIR
 split_count=$SPLIT_COUNT
 set -e
 mkdir -p $out_dir
 cp -r /quick_start $out_dir/
 mkdir -p $out_dir/0/data
 cd $out_dir/0/data
 wget http://paddlepaddle.bj.bcebos.com/demo/quick_start_preprocessed_data/preprocessed_data.tar.gz
 tar zxvf preprocessed_data.tar.gz
 rm preprocessed_data.tar.gz
 split -d --number=l/$split_count -a 5 train.txt train.
 mv train.00000 train.txt
 cd $out_dir
 end=$(expr $split_count - 1)
 for i in $(seq 1 $end); do
    mkdir -p $i/data
    cp -r 0/data/* $i/data
    mv $i/data/train.`printf %05d $i` $i/data/train.txt
 done;
--- a/doc/howto/usage/k8s/src/k8s_train/Dockerfile
+++ b/doc/howto/usage/k8s/src/k8s_train/Dockerfile
@ -1,6 +0,0 @@
 FROM paddledev/paddle:cpu-latest
 COPY start.sh /root/
 COPY start_paddle.py /root/
 RUN chmod +x /root/start.sh
 CMD ["bash"," -c","/root/start.sh"]
--- a/doc/howto/usage/k8s/src/k8s_train/README.md
+++ b/doc/howto/usage/k8s/src/k8s_train/README.md
@ -1,5 +0,0 @@
 To build PaddlePaddle training image in tutorial [Distributed PaddlePaddle Training on AWS with Kubernetes](../../k8s_aws_en.md), run following command:
 ```
 docker build . -t train-image-name
 ```
--- a/doc/howto/usage/k8s/src/k8s_train/start.sh
+++ b/doc/howto/usage/k8s/src/k8s_train/start.sh
@ -1,19 +0,0 @@
 #!/bin/sh
 set -eu
 jobconfig=${JOB_PATH}"/"${JOB_NAME}"/"${TRAIN_CONFIG_DIR}
 cd /root
 cp -rf $jobconfig/* .
 python /root/start_paddle.py \
  --dot_period=10 \
  --ports_num=$CONF_PADDLE_PORTS_NUM \
  --ports_num_for_sparse=$CONF_PADDLE_PORTS_NUM_SPARSE \
  --log_period=50 \
  --num_passes=10 \
  --trainer_count=$TRAINER_COUNT \
  --saving_period=1 \
  --local=0 \
  --config=trainer_config.lr.py \
  --use_gpu=0
--- a/doc/howto/usage/k8s/src/k8s_train/start_paddle.py
+++ b/doc/howto/usage/k8s/src/k8s_train/start_paddle.py
@ -1,170 +0,0 @@
 #!/usr/bin/python
 # Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 import requests
 import time
 import socket
 import os
 import argparse
 # configuration for cluster
 API = "/api/v1/namespaces/"
 JOBSELECTOR = "labelSelector=job-name="
 JOB_PATH = os.getenv("JOB_PATH") + "/" + os.getenv("JOB_NAME")
 JOB_PATH_OUTPUT = JOB_PATH + "/output"
 JOBNAME = os.getenv("JOB_NAME")
 NAMESPACE = os.getenv("JOB_NAMESPACE")
 PADDLE_NIC = os.getenv("CONF_PADDLE_NIC")
 PADDLE_PORT = os.getenv("CONF_PADDLE_PORT")
 PADDLE_PORTS_NUM = os.getenv("CONF_PADDLE_PORTS_NUM")
 PADDLE_PORTS_NUM_SPARSE = os.getenv("CONF_PADDLE_PORTS_NUM_SPARSE")
 PADDLE_SERVER_NUM = os.getenv("CONF_PADDLE_GRADIENT_NUM")
 tokenpath = '/var/run/secrets/kubernetes.io/serviceaccount/token'
 def refine_unknown_args(cmd_args):
    '''
    refine unknown parameters to handle some special parameters
    '''
    new_args = []
    for arg in cmd_args:
        if arg.startswith("--") and arg.find("=") != -1:
            equal_pos = arg.find("=")  # find first = pos
            arglist = list(arg)
            arglist[equal_pos] = " "
            arg = "".join(arglist)
            arg = arg.lstrip("-")
            new_args += arg.split(" ")
        elif arg.startswith("--") and arg.find("=") == -1:
            arg = arg.lstrip("-")
            new_args.append(arg)
        else:
            new_args.append(arg)
    return new_args
 def isPodAllRunning(podlist):
    '''
    check all pod is running
    '''
    require = len(podlist["items"])
    running = 0
    for pod in podlist["items"]:
        if pod["status"]["phase"] == "Running":
            running += 1
    print "waiting for pods running, require:", require, "running:", running
    if require == running:
        return True
    return False
 def getPodList():
    '''
    get all container status of the job
    '''
    apiserver = "https://" + \
        os.getenv("KUBERNETES_SERVICE_HOST") + ":" + \
        os.getenv("KUBERNETES_SERVICE_PORT_HTTPS")
    pod = API + NAMESPACE + "/pods?"
    job = JOBNAME
    if os.path.isfile(tokenpath):
        tokenfile = open(tokenpath, mode='r')
        token = tokenfile.read()
        Bearer = "Bearer " + token
        headers = {"Authorization": Bearer}
        return requests.get(apiserver + pod + JOBSELECTOR + job,
                            headers=headers,
                            verify=False).json()
    else:
        return requests.get(apiserver + pod + JOBSELECTOR + job,
                            verify=False).json()
 def getIdMap(podlist):
    '''
    generate tainer_id by ip
    '''
    ips = []
    for pod in podlist["items"]:
        ips.append(pod["status"]["podIP"])
    ips.sort()
    idMap = {}
    for i in range(len(ips)):
        idMap[ips[i]] = i
    return idMap
 def startPaddle(idMap={}, train_args_dict=None):
    '''
    start paddle pserver and trainer
    '''
    program = 'paddle train'
    args = " --nics=" + PADDLE_NIC
    args += " --port=" + str(PADDLE_PORT)
    args += " --ports_num=" + str(PADDLE_PORTS_NUM)
    args += " --comment=" + "paddle_process_by_paddle"
    ip_string = ""
    for ip in idMap.keys():
        ip_string += (ip + ",")
    ip_string = ip_string.rstrip(",")
    args += " --pservers=" + ip_string
    args_ext = ""
    for key, value in train_args_dict.items():
        args_ext += (' --' + key + '=' + value)
    localIP = socket.gethostbyname(socket.gethostname())
    trainerId = idMap[localIP]
    args += " " + args_ext + " --trainer_id=" + \
        str(trainerId) + " --save_dir=" + JOB_PATH_OUTPUT
    logDir = JOB_PATH_OUTPUT + "/node_" + str(trainerId)
    if not os.path.exists(JOB_PATH_OUTPUT):
        os.makedirs(JOB_PATH_OUTPUT)
    if not os.path.exists(logDir):
        os.mkdir(logDir)
    copyCommand = 'cp -rf ' + JOB_PATH + \
        "/" + str(trainerId) + "/data/*" + " ./data/"
    os.system(copyCommand)
    startPserver = 'nohup paddle pserver' + \
        " --port=" + str(PADDLE_PORT) + \
        " --ports_num=" + str(PADDLE_PORTS_NUM) + \
        " --ports_num_for_sparse=" + str(PADDLE_PORTS_NUM_SPARSE) + \
        " --nics=" + PADDLE_NIC + \
        " --comment=" + "paddle_process_by_paddle" + \
        " --num_gradient_servers=" + str(PADDLE_SERVER_NUM) +\
        " > " + logDir + "/server.log 2>&1 &"
    print startPserver
    os.system(startPserver)
    # wait until pservers completely start
    time.sleep(20)
    startTrainer = program + args + " 2>&1 | tee " + \
        logDir + "/train.log"
    print startTrainer
    os.system(startTrainer)
 if __name__ == '__main__':
    parser = argparse.ArgumentParser(
        prog="start_paddle.py", description='simple tool for k8s')
    args, train_args_list = parser.parse_known_args()
    train_args = refine_unknown_args(train_args_list)
    train_args_dict = dict(zip(train_args[:-1:2], train_args[1::2]))
    podlist = getPodList()
    # need to wait until all pods are running
    while not isPodAllRunning(podlist):
        time.sleep(20)
        podlist = getPodList()
    idMap = getIdMap(podlist)
    startPaddle(idMap, train_args_dict)
--- a/doc/howto/usage/k8s/src/managed_policy.png
+++ b/doc/howto/usage/k8s/src/managed_policy.png
--- a/doc/howto/usage/k8s/src/pserver_and_trainer.png
+++ b/doc/howto/usage/k8s/src/pserver_and_trainer.png
--- a/doc/howto/usage/k8s/src/route53_create_recordset.png
+++ b/doc/howto/usage/k8s/src/route53_create_recordset.png
--- a/doc/howto/usage/k8s/src/route53_create_zone.png
+++ b/doc/howto/usage/k8s/src/route53_create_zone.png
--- a/doc/howto/usage/k8s/src/worker_security_group.png
+++ b/doc/howto/usage/k8s/src/worker_security_group.png