@ -5,9 +5,11 @@ In a large scale machine learning setup where the size of the training data is h
Polyak and Juditsky (1992) showed that the test performance of simple average of parameters obtained by Stochastic Gradient Descent (SGD) is as good as that of parameter values that are obtained by training the model over and over again, over the training dataset.
Hence, to accelerate the speed of Stochastic Gradient Descent, Averaged Stochastic Gradient Descent (ASGD) was proposed in Polyak and Juditsky (1992). For ASGD, the running average of parameters obtained by SGD, is used as the estimator for <imgsrc="./images/theta_star.gif"/><br/> . The averaging is done as follows:
Hence, to accelerate the speed of Stochastic Gradient Descent, Averaged Stochastic Gradient Descent (ASGD) was proposed in Polyak and Juditsky (1992). For ASGD, the running average of parameters obtained by SGD, is used as the estimator for <imgsrc="https://raw.githubusercontent.com/PaddlePaddle/Paddle/develop/doc/fluid/images/theta_star.gif"/><br/> . The averaging is done as follows:
@ -14,11 +14,29 @@ In programming languages, a block is a pair of curly braces that includes local
Blocks work with control flow structures like `if`, `else`, and `for`, which have equivalents in deep learning:
| programming languages | PaddlePaddle |
|-----------------------|-----------------------|
| for, while loop | RNN, WhileOp |
| if, if-else, switch | IfElseOp, SwitchOp |
| sequential execution | a sequence of layers |
<table>
<thead>
<tr>
<th>programming languages</th>
<th>PaddlePaddle</th>
</tr>
</thead>
<tbody>
<tr>
<td>for, while loop </td>
<td>RNN, WhileOp </td>
</tr>
<tr>
<td>if, if-else, switch </td>
<td>IfElseOp, SwitchOp </td>
</tr>
<tr>
<td>sequential execution </td>
<td>a sequence of layers </td>
</tr>
</tbody>
</table>
A key difference is that a C++ program describes a one pass computation, whereas a deep learning program describes both the forward and backward passes.
@ -26,12 +44,33 @@ A key difference is that a C++ program describes a one pass computation, whereas
The existence of the backward pass makes the execution of a block of PaddlePaddle different from traditional programs:
We'd like to have Python bindings to operators in package `paddle.operator`, and Python compositions of operators in package `paddle.layer`. So we have the following concepts in above illustrative example:
Like other deep learning systems, PaddlePaddle supports training models from sequence data. Also, like other systems, PaddlePaddle represent a mini-batch of sequences as a Tensor. What is different is that PaddlePaddle doesn't require all sequences in a mini-batch to be of the same length. Thus no need for padding zeros.
PaddlePaddle achieves this flexibility by passing through a new data type, *LoD Tensor*, which is a Tensor attached with segmentation index known as *LoD*, between operators. The LoD index doesn't only segment a tensor, but also recursively segments sub-sequences. This document presents the design of LoD and LoDTensor.
@ -10,10 +10,27 @@ PaddlePaddle uses proto message to describe compile time program because :
The computation `Program` consists of nested `Blocks`. Each `Block` will consist of data(i.e. `Variable`) and `Operations`. The concept to represent them is in the table below.
@ -40,11 +40,11 @@ computation is only specified in Python code which sits outside of PaddlePaddle,
Similar to how a compiler uses an intermediate representation (IR) so that the programmer does not need to manually optimize their code for most of the cases, we can have an intermediate representation in PaddlePaddle as well. The compiler optimizes the IR as follows:
PaddlePaddle can support model parallelism by converting the IR so that the user no longer needs to manually perform the computation and operations in the Python component:
The IR for PaddlePaddle after refactoring is called a `Block`, it specifies the computation dependency graph and the variables used in the computation.
@ -60,7 +60,7 @@ For a detailed explanation, refer to this document -
The revamped distributed training architecture can address the above discussed limitations. Below is the illustration of how it does so:
@ -171,7 +171,7 @@ In the future, a more general placement algorithm should be implemented, which m
The local training architecture will be the same as the distributed training architecture, the difference is that everything runs locally, and there is just one PaddlePaddle runtime:
The above diagram shows an RNN unrolled into a full network.
@ -22,7 +22,7 @@ There are several important concepts here:
There could be local variables defined in each step-net. PaddlePaddle runtime realizes these variables in *step-scopes* which are created for each step.
@ -93,7 +93,7 @@ For example, we could have a 2-level RNN, where the top level corresponds to par
The following figure illustrates feeding in text into the lower level, one sentence at a step, and the feeding in step outputs to the top level. The final top level output is about the whole text.
cudnn provides APIs to finish the whole series of computation, we can use them in our GPU kernel.
@ -124,7 +124,7 @@ for pass_id in range(PASS_NUM):
`is_infer` is an attribute. Once an operator is created, its attributes can not be changed. It suggests us that we shall maintain two `batch_norm_op` in the model, one's `is_infer` is `True`(we call it `infer_batch_norm_op`) and the other one's is `False`(we call it `train_batch_norm_op`). They share all parameters and variables, but be placed in two different branches. That is to say, if a network contains a `batch_norm_op`, it will fork into two branches, one go through `train_batch_norm_op` and the other one go through `infer_batch_norm_op`:
Just like what is shown in the above graph, the net forks before `batch_norm_op` and will never merge again. All the operators after `batch_norm_op` will duplicate.
@ -6,17 +6,17 @@ A central problem in machine learning is how to design an algorithm that will pe
### Parameter Norm Penalties
Most common regularization approaches in deep learning are based on limiting the capacity of the models by adding a parameter norm penalty to the objective function `J`. This is given as follows:
The parameter `alpha` is a hyperparameter that weights the relative contribution of the norm penalty term, `omega`, relative to the standard objective function `J`.
The most commonly used norm penalties are the L2 norm penalty and the L1 norm penalty. These are given as follows:
@ -64,9 +64,3 @@ Since we want to create the regularization ops in a lazy manner, the regularizat
#### High-level API
In PaddlePaddle Python API, users will primarily rely on [layer functions](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/python_api.md#layer-function) to create neural network layers. Hence, we also need to provide regularization functionality in layer functions. The design of these APIs can be postponed for later right now. A good reference for these APIs can be found in [Keras](https://keras.io/regularizers/) and also by looking at Tensorflow in [`tf.contrib.layers`](https://www.tensorflow.org/api_guides/python/contrib.layers).
@ -10,11 +10,37 @@ Fluid is the answer. Fluid is similar to PyTorch and TensorFlow Eager Execution
Deep learning infrastructure is one of the fastest evolving technologies. Within four years, there have already been three generations of technologies invented.
| Existed since | model as sequence of layers | model as graph of operators | No model |
From the above table, we see that the deep learning technology is evolving towards getting rid of the concept of a model. To understand the reasons behind this direction, a comparison of the *programming paradigms* or the ways to program deep learning applications using these systems, would be helpful. The following section goes over these.
The word *graph* is interchangeable with *block* in this document. A graph consists of computation steps and local variables similar to a C++/Java program block, or a pair of parentheses(`{` and `}`).
@ -121,18 +143,63 @@ Key ingredients about the layers:
- Added to all above layers (except for data and loss layer).
- Sequence-wise normalization for RNNs: BatchNorm only performed on input-state projection and not state-state projection, for efficiency consideration.
<table>
<thead>
<tr>
<th>Required Components</th>
<th> PaddlePaddle Support</th>
<th> Need to Develop</th>
</tr>
</thead>
<tbody>
<tr>
<td>Data Layer I (Spectrogram) </td>
<td>Not supported yet.</td>
<td>TBD (Task 3)</td>
</tr>
<tr>
<td>Data Layer II (Transcription) </td>
<td> paddle.data_type.integer_value_sequence</td>
<td> - </td>
</tr>
<tr>
<td>2D Convolution Layer </td>
<td> paddle.layer.image_conv_layer</td>
<td> - </td>
</tr>
<tr>
<td>DataType Converter (vec2seq)</td>
<td> paddle.layer.block_expand</td>
<td> - </td>
</tr>
<tr>
<td>Bi-/Uni-directional RNNs </td>
<td>paddle.layer.recurrent_group</td>
<td> - </td>
</tr>
<tr>
<td>Row Convolution Layer </td>
<td>Not supported yet.</td>
<td>TBD (Task 4)</td>
</tr>
<tr>
<td>CTC-loss Layer </td>
<td>paddle.layer.warp_ctc</td>
<td> - </td>
</tr>
<tr>
<td>Batch Normalization Layer </td>
<td>paddle.layer.batch_norm</td>
<td> - </td>
</tr>
<tr>
<td>CTC-Beam search </td>
<td>Not supported yet.</td>
<td> TBD (Task 6) </td>
</tr>
</tbody>
</table>
Required Components | PaddlePaddle Support | Need to Develop
@ -7,14 +7,14 @@ It applies several important concepts in machine learning system design, includi
In our GAN design, we wrap it as a user-friendly easily customized python API to design different models. We take the conditional DC-GAN (Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks [https://arxiv.org/abs/1511.06434]) as an example due to its good performance on image generation.
Figure 1. The overall running logic of GAN. The black solid arrows indicate the forward pass; the green dashed arrows indicate the backward pass of generator training; the red dashed arrows indicate the backward pass of the discriminator training. The BP pass of the green (red) arrow should only update the parameters in the green (red) boxes. The diamonds indicate the data providers. d\_loss and g\_loss marked in red and green are the two targets we would like to run.
</p>
The operators, layers and functions required/optional to build a GAN demo is summarized in https://github.com/PaddlePaddle/Paddle/issues/4563.