Merge pull request #6674 from typhoonzero/refine_doc_structure
refine cluster doc structuredel_some_in_makelist
@ -0,0 +1,42 @@
|
||||
# 使用fabric启动集群训练
|
||||
|
||||
## 准备一个Linux集群
|
||||
可以在`paddle/scripts/cluster_train_v2/fabric/docker_cluster`目录下,执行`kubectl -f ssh_servers.yaml`启动一个测试集群,并使用`kubectl get po -o wide`获得这些节点的IP地址。
|
||||
|
||||
## 启动集群作业
|
||||
|
||||
`paddle.py` 提供了自动化脚本来启动不同节点中的所有 PaddlePaddle 集群进程。默认情况下,所有命令行选项可以设置为 `paddle.py` 命令选项并且 `paddle.py` 将透明、自动地将这些选项应用到 PaddlePaddle 底层进程。
|
||||
|
||||
`paddle.py` 为方便作业启动提供了两个独特的命令选项。
|
||||
|
||||
- `job_dispatch_package` 设为本地 `workspace` 目录,它将被分发到 `conf.py` 中设置的所有节点。它有助于帮助频繁修改和访问工作区文件的用户减少负担,否则频繁的多节点工作空间部署可能会很麻烦。
|
||||
- `job_workspace` 设为已部署的工作空间目录,`paddle.py` 将跳过分发阶段直接启动所有节点的集群作业。它可以帮助减少分发延迟。
|
||||
|
||||
`cluster_train/run.sh` 提供了命令样例来运行 `doc/howto/usage/cluster/src/word2vec` 集群任务,只需用您定义的目录修改 `job_dispatch_package` 和 `job_workspace`,然后:
|
||||
```
|
||||
sh run.sh
|
||||
```
|
||||
|
||||
集群作业将会在几秒后启动。
|
||||
|
||||
## 终止集群作业
|
||||
`paddle.py`能获取`Ctrl + C` SIGINT 信号来自动终止它启动的所有进程。只需中断 `paddle.py` 任务来终止集群作业。如果程序崩溃你也可以手动终止。
|
||||
|
||||
## 检查集群训练结果
|
||||
详细信息请检查 $workspace/log 里的日志,每一个节点都有相同的日志结构。
|
||||
|
||||
`paddle_trainer.INFO`
|
||||
提供几乎所有训练的内部输出日志,与本地训练相同。这里检验运行时间模型的收敛。
|
||||
|
||||
`paddle_pserver2.INFO`
|
||||
提供 pserver 运行日志,有助于诊断分布式错误。
|
||||
|
||||
`server.log`
|
||||
提供 parameter server 进程的 stderr 和 stdout。训练失败时可以检查错误日志。
|
||||
|
||||
`train.log`
|
||||
提供训练过程的 stderr 和 stdout。训练失败时可以检查错误日志。
|
||||
|
||||
## 检查模型输出
|
||||
运行完成后,模型文件将被写入节点 0 的 `output` 目录中。
|
||||
工作空间中的 `nodefile` 表示当前集群作业的节点 ID。
|
@ -0,0 +1,43 @@
|
||||
# Cluster Training Using Fabric
|
||||
|
||||
## Prepare a Linux cluster
|
||||
|
||||
Run `kubectl -f ssh_servers.yaml` under the directory: `paddle/scripts/cluster_train_v2/fabric/docker_cluster` will launch a demo cluster. Run `kubectl get po -o wide` to get IP addresses of these nodes.
|
||||
|
||||
## Launching Cluster Job
|
||||
`paddle.py` provides automatical scripts to start all PaddlePaddle cluster processes in different nodes. By default, all command line options can be set as `paddle.py` command options and `paddle.py` will transparently and automatically set these options to PaddlePaddle lower level processes.
|
||||
|
||||
`paddle.py`provides two distinguished command option for easy job launching.
|
||||
|
||||
- `job_dispatch_package` set it with local `workspace` directory, it will be dispatched to all nodes which is set in `conf.py`. It could be helpful for frequently manipulating workspace files. otherwise, frequent multi-nodes workspace deployment is very annoying.
|
||||
- `job_workspace` set it with already deployed workspace directory, `paddle.py` will skip dispatch stage to directly launch cluster job with all nodes. It could help to reduce heavy
|
||||
dispatch latency.
|
||||
|
||||
`cluster_train/run.sh` provides command line sample to run `demo/recommendation` cluster job, just modify `job_dispatch_package` and `job_workspace` with your defined directory, then:
|
||||
```
|
||||
sh run.sh
|
||||
```
|
||||
|
||||
The cluster Job will start in several seconds.
|
||||
|
||||
## Kill Cluster Job
|
||||
`paddle.py` can capture `Ctrl + C` SIGINT signal to automatically kill all processes launched by it. So just stop `paddle.py` to kill cluster job. You should manually kill the job if the program crashed.
|
||||
|
||||
## Check Cluster Training Result
|
||||
Check log in $workspace/log for details, each node owns same log structure.
|
||||
|
||||
`paddle_trainer.INFO`
|
||||
It provides almost all internal output log for training, same as local training. Check runtime model convergence here.
|
||||
|
||||
`paddle_pserver2.INFO`
|
||||
It provides parameter server running log, which could help to diagnose distributed error.
|
||||
|
||||
`server.log`
|
||||
It provides stderr and stdout of parameter server process. Check error log if training crashes.
|
||||
|
||||
`train.log`
|
||||
It provides stderr and stdout of trainer process. Check error log if training crashes.
|
||||
|
||||
## Check Model Output
|
||||
After one pass finished, model files will be written in `output` directory in node 0.
|
||||
`nodefile` in workspace indicates the node id of current cluster job.
|
@ -0,0 +1 @@
|
||||
k8s_aws_en.md
|
@ -0,0 +1,41 @@
|
||||
# Cluster Training Using OpenMPI
|
||||
|
||||
## Prepare an OpenMPI cluster
|
||||
|
||||
Run the following command to start a 3-node MPI cluster and one "head" node.
|
||||
|
||||
```bash
|
||||
cd paddle/scripts/cluster_train_v2/openmpi/docker_cluster
|
||||
kubectl create -f head.yaml
|
||||
kubectl create -f mpi-nodes.yaml
|
||||
```
|
||||
|
||||
Then you can log in to every OpenMPI node using ssh without input any passwords.
|
||||
|
||||
## Launching Cluster Job
|
||||
|
||||
Follow the steps to launch a PaddlePaddle training job in OpenMPI cluster:\
|
||||
|
||||
```bash
|
||||
# find out node IP addresses
|
||||
kubectl get po -o wide
|
||||
# generate a "machines" file containing node IP addresses
|
||||
kubectl get po -o wide | grep nodes | awk '{print $6}' > machines
|
||||
# copy necessary files onto "head" node
|
||||
scp -i ssh/id_rsa.mpi.pub machines prepare.py train.py start_mpi_train.sh tutorial@[headIP]:~
|
||||
# login to head node using ssh
|
||||
ssh -i ssh/id_rsa.mpi.pub tutorial@[headIP]
|
||||
# --------------- in head node ---------------
|
||||
# prepare training data
|
||||
python prepare.py
|
||||
# copy training data and dict file to MPI nodes
|
||||
cat machines | xargs -i scp word_dict.pickle train.py start_mpi_train.sh machines {}:/home/tutorial
|
||||
# creat a directory for storing log files
|
||||
mpirun -hostfile machines -n 3 mkdir /home/tutorial/logs
|
||||
# copy training data to every node
|
||||
scp train.txt-00000 test.txt-00000 [node1IP]:/home/tutorial
|
||||
scp train.txt-00001 test.txt-00001 [node2IP]:/home/tutorial
|
||||
scp train.txt-00002 test.txt-00002 [node3IP]:/home/tutorial
|
||||
# start the job
|
||||
mpirun -hostfile machines -n 3 /home/tutorial/start_mpi_train.sh
|
||||
```
|
Before Width: | Height: | Size: 116 KiB After Width: | Height: | Size: 116 KiB |
Before Width: | Height: | Size: 236 KiB After Width: | Height: | Size: 236 KiB |
Before Width: | Height: | Size: 225 KiB After Width: | Height: | Size: 225 KiB |
After Width: | Height: | Size: 421 KiB |