diff --git a/paddle/operators/batch_norm_op.md b/paddle/operators/batch_norm_op.md new file mode 100644 index 0000000000..80948adf2b --- /dev/null +++ b/paddle/operators/batch_norm_op.md @@ -0,0 +1,134 @@ +# Batch Normalization + +## What is batch normalization + +Batch normalization is a frequently-used method in deep network training. It adjusts the mean and variance of a layer's output, and make the data distribution easier for next layer's training. + +The principle of batch normalization can be summarized into a simple function: + +``` +y = (x - E[x]) / STD[x]) * scale + bias +``` + +`x` is a batch of output data of a certain layer. `E[x]` and `STD[x]` is the mean and standard deviation of `x`, respectively。 `scale` and `bias` are two trainable parameters. The training of batch normalization layer equals to the learning of best values of `scale` and `bias`. + +In our design, we use a single operator(`batch_norm_op`) to implement the whole batch normalization in C++, and wrap it as a layer in Python. + +## Differences with normal operators + +`batch_norm_op` is a single operator. However, there are a few differences between `BatchNormOp` and normal operators, which we shall take into consideration in our design. + +1. `batch_norm_op` shall behave differently in training and inferencing. For example, during inferencing, there is no batch data and it's impossible to compute `E[x]` and `STD[x]`, so we have to use an `estimated_mean` and an `estimated_variance` instead of them. These require our framework to be able to inform operators current running type (training/inferencing), then operators can switch their behaviors. + +2. `batch_norm_op` shall have the ability to maintain `estimated_mean` and `estimated_variance` across mini-batch. In each mini-batch, `estimated_mean` is iterated by the following equations: + +``` +if batch_id == 0 + estimated_mean = E[x] +else + estimated_mean = estimated_mean * momentum + (1.0 - momentum_) * E[x] +``` + +The iterating of `estimated_variance` is similar. `momentum` is an attribute, which controls estimated_mean updating speed. + +## Implementation + +Batch normalization is designed as a single operator is C++, and then wrapped as a layer in Python. + +### C++ + +As most C++ operators do, `batch_norm_op` is defined by inputs, outputs, attributes and compute kernels. + +#### Inputs + +- `x`: The inputs data, which is generated by the previous layer. +- `estimated_mean`: The estimated mean of all previous data batches. It is updated in each forward propagation and will be used in inferencing to take the role of `E[x]`. +- `estimated_var`: The estimated standard deviation of all previous data batches. It is updated in each forward propagation and will be used in inferencing to take the role of `STD[x]`. +- `scale`: trainable parameter 'scale' +- `bias`: trainable parameter 'bias' + +#### Outputs + +- `y`: The output data. +- `batch_mean`: The mean value of batch data. +- `batch_var`: The standard deviation value of batch data. +- `saved_mean`: Updated `estimated_mean` with current batch data. It's supposed to share the memory with input `estimated_mean`. +- `saved_var`: Updated `estimated_var` with current batch data. It's supposed to share the memory with input `estimated_var`. + +#### Attributes + +- `is_infer`: *bool*. If true, run `batch_norm_op` in inferencing mode. +- `use_global_est`: *bool*. If true, use `saved_mean` and `saved_var` instead of `E[x]` and `STD[x]` in trainning. +- `epsilon`: *float*. The epsilon value to avoid division by zero. +- `momentum`: *float*. Factor used in `estimated_mean` and `estimated_var` updating. The usage is shown above. + +#### Kernels + +The following graph showes the training computational process of `batch_norm_op`: + + + +cudnn provides APIs to finish the whole series of computation, we can use them in our GPU kernel. + +### Python + +`batch_norm_op` is warpped as a layer in Python: + +```python +def batch_norm_layer(net, + input, + output, + scale, + bias, + use_global_est = False, + epsilon = 1e-6, + momentum = 0.99): + mean_cache = scope.new_var(name = 'estimated_mean', trainable = False) + var_cache = scop.new_var(name = 'estimated_var', trainable = False) + batch_mean = scope.new_var(name = 'batch_mean') + batch_var = scope.new_var(name = 'batch_var') + batch_norm_op = Operator('batch_norm_op', + x = input, + estimated_mean = mean_cache, + estimated_mean = var_cache, + scale = scale, + bias = bias, + y = output, + batch_mean = batch_mean, + batch_var = batch_var, + saved_mean = mean_cache, + saved_var = var_cache, + is_infer = False, + use_global_est = use_global_est, + epsilon = epsilon, + momentum = momentum) + net.append_op(batch_norm_op) + return output +``` + +Because Python API has not been finally decided, the code above can be regarded as pseudo code. There are a few key points we shall note: + +1. `estimated_mean` and `estimated_var` are assigned the same variables with `saved_mean` and `saved_var` respectively. So they share same the memories. The output mean and variance values(`saved_mean` and `saved_var`) of a certain batch will be the inputs(`estimated_mean` and `estimated_var`) of the next batch. + +2. `is_infer` decided whether `batch_norm_op` will run in training mode or inferencing mode. However, a network may contains both training and inferencing parts. And user may switch `batch_norm_op`'s running mode in Python `for` loop like this: + +```python +for pass_id in range(PASS_NUM): + # ... + net.train() # run training model + if pass_id % 100 == 0: + net.infer(test_image) # run inferencing model + # ... +``` + +`is_infer` is an attribute. Once an operator is created, its attributes can not be changed. It suggests us that we shall maintain two `batch_norm_op` in the model, one's `is_infer` is `True`(we call it `infer_batch_norm_op`) and the other one's is `False`(we call it `train_batch_norm_op`). They share all parameters and variables, but be placed in two different branches. That is to say, if a network contains a `batch_norm_op`, it will fork into two branches, one go through `train_batch_norm_op` and the other one go through `infer_batch_norm_op`: + +