refine structure of cluster and quick start

emailweixu-patch-1
Luo Tao 7 years ago
parent b1869f1695
commit 87e3bdac4e

@ -1,6 +1,9 @@
快速开始
========
快速安装
--------
PaddlePaddle支持使用pip快速安装目前支持CentOS 6以上, Ubuntu 14.04以及MacOS 10.12并安装有Python2.7。
执行下面的命令完成快速安装版本为cpu_avx_openblas
@ -16,6 +19,9 @@ PaddlePaddle支持使用pip快速安装目前支持CentOS 6以上, Ubuntu 14.
更详细的安装和编译方法参考::ref:`install_steps`
快速使用
--------
创建一个 housing.py 并粘贴此Python代码
.. code-block:: python

@ -1,6 +1,9 @@
Quick Start
============
Quick Install
-------------
You can use pip to install PaddlePaddle with a single command, supports
CentOS 6 above, Ubuntu 14.04 above or MacOS 10.12, with Python 2.7 installed.
Simply run the following command to install, the version is cpu_avx_openblas:
@ -17,6 +20,9 @@ If you need to install GPU version (cuda7.5_cudnn5_avx_openblas), run:
For more details about installation and build: :ref:`install_steps` .
Quick Use
---------
Create a new file called housing.py, and paste this Python
code:

@ -1,10 +1,22 @@
分布式训练
==========
本节将介绍如何使用PaddlePaddle在不同的集群框架下完成分布式训练。分布式训练架构如下图所示
.. image:: src/ps_cn.png
:width: 500
- 数据分片Data shard): 用于训练神经网络的数据被切分成多个部分每个部分分别给每个trainer使用。
- 计算节点Trainer: 每个trainer启动后读取切分好的一部分数据开始神经网络的“前馈”和“后馈”计算并和参数服务器通信。在完成一定量数据的训练后上传计算得出的梯度gradients然后下载优化更新后的神经网络参数parameters
- 参数服务器Parameter server:每个参数服务器只保存整个神经网络所有参数的一部分。参数服务器接收从计算节点上传的梯度,并完成参数优化更新,再将更新后的参数下发到每个计算节点。
这样通过计算节点和参数服务器的分布式协作可以完成神经网络的SGD方法的训练。PaddlePaddle可以同时支持同步随机梯度下降SGD和异步随机梯度下降。
在使用同步SGD训练神经网络时PaddlePaddle使用同步屏障barrier使梯度的提交和参数的更新按照顺序方式执行。在异步SGD中则并不会等待所有trainer提交梯度才更新参数这样极大地提高了计算的并行性参数服务器之间不相互依赖并行地接收梯度和更新参数参数服务器也不会等待计算节点全部都提交梯度之后才开始下一步计算节点之间也不会相互依赖并行地执行模型的训练。可以看出虽然异步SGD方式会提高参数更新并行度, 但是并不能保证参数同步更新在任意时间某一台参数服务器上保存的参数可能比另一台要更新与同步SGD相比梯度会有噪声。
.. toctree::
:maxdepth: 1
introduction_cn.md
preparations_cn.md
cmd_argument_cn.md
multi_cluster/index_cn.rst

@ -1,10 +1,22 @@
Distributed Training
====================
In this section, we'll explain how to run distributed training jobs with PaddlePaddle on different types of clusters. The diagram below shows the main architecture of a distributed trainning job:
.. image:: src/ps_en.png
:width: 500
- Data shard: training data will be split into multiple partitions, trainers use the partitions of the whole dataset to do the training job.
- Trainer: each trainer reads the data shard, and train the neural network. Then the trainer will upload calculated "gradients" to parameter servers, and wait for parameters to be optimized on the parameter server side. When that finishes, the trainer download optimized parameters and continues its training.
- Parameter server: every parameter server stores part of the whole neural network model data. They will do optimization calculations when gradients are uploaded from trainers, and then send updated parameters to trainers.
PaddlePaddle can support both synchronize stochastic gradient descent (SGD) and asynchronous SGD.
When training with synchronize SGD, PaddlePaddle uses an internal "synchronize barrier" which makes gradients update and parameter download in strict order. On the other hand, asynchronous SGD won't wait for all trainers to finish upload at a single step, this will increase the parallelism of distributed training: parameter servers do not depend on each other, they'll do parameter optimization concurrently. Parameter servers will not wait for trainers, so trainers will also do their work concurrently. But asynchronous SGD will introduce more randomness and noises in the gradient.
.. toctree::
:maxdepth: 1
introduction_en.md
preparations_en.md
cmd_argument_en.md
multi_cluster/index_en.rst

@ -1,13 +0,0 @@
## 概述
本节将介绍如何使用PaddlePaddle在不同的集群框架下完成分布式训练。分布式训练架构如下图所示
<img src="https://user-images.githubusercontent.com/13348433/31772175-5f419eca-b511-11e7-9db7-5231fe3d9ccb.png" width="500">
- 数据分片Data shard): 用于训练神经网络的数据被切分成多个部分每个部分分别给每个trainer使用。
- 计算节点Trainer: 每个trainer启动后读取切分好的一部分数据开始神经网络的“前馈”和“后馈”计算并和参数服务器通信。在完成一定量数据的训练后上传计算得出的梯度gradients然后下载优化更新后的神经网络参数parameters
- 参数服务器Parameter server:每个参数服务器只保存整个神经网络所有参数的一部分。参数服务器接收从计算节点上传的梯度,并完成参数优化更新,再将更新后的参数下发到每个计算节点。
这样通过计算节点和参数服务器的分布式协作可以完成神经网络的SGD方法的训练。PaddlePaddle可以同时支持同步随机梯度下降SGD和异步随机梯度下降。
在使用同步SGD训练神经网络时PaddlePaddle使用同步屏障barrier使梯度的提交和参数的更新按照顺序方式执行。在异步SGD中则并不会等待所有trainer提交梯度才更新参数这样极大地提高了计算的并行性参数服务器之间不相互依赖并行地接收梯度和更新参数参数服务器也不会等待计算节点全部都提交梯度之后才开始下一步计算节点之间也不会相互依赖并行地执行模型的训练。可以看出虽然异步SGD方式会提高参数更新并行度, 但是并不能保证参数同步更新在任意时间某一台参数服务器上保存的参数可能比另一台要更新与同步SGD相比梯度会有噪声。

@ -1,13 +0,0 @@
## Introduction
In this section, we'll explain how to run distributed training jobs with PaddlePaddle on different types of clusters. The diagram below shows the main architecture of a distributed trainning job:
<img src="https://user-images.githubusercontent.com/13348433/31772146-41523d84-b511-11e7-8a12-a69fd136c283.png" width="500">
- Data shard: training data will be split into multiple partitions, trainers use the partitions of the whole dataset to do the training job.
- Trainer: each trainer reads the data shard, and train the neural network. Then the trainer will upload calculated "gradients" to parameter servers, and wait for parameters to be optimized on the parameter server side. When that finishes, the trainer download optimized parameters and continues its training.
- Parameter server: every parameter server stores part of the whole neural network model data. They will do optimization calculations when gradients are uploaded from trainers, and then send updated parameters to trainers.
PaddlePaddle can support both synchronize stochastic gradient descent (SGD) and asynchronous SGD.
When training with synchronize SGD, PaddlePaddle uses an internal "synchronize barrier" which makes gradients update and parameter download in strict order. On the other hand, asynchronous SGD won't wait for all trainers to finish upload at a single step, this will increase the parallelism of distributed training: parameter servers do not depend on each other, they'll do parameter optimization concurrently. Parameter servers will not wait for trainers, so trainers will also do their work concurrently. But asynchronous SGD will introduce more randomness and noises in the gradient.

Binary file not shown.

After

Width:  |  Height:  |  Size: 33 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 142 KiB

Loading…
Cancel
Save