# Batch Normalization
## What is batch normalization
Batch normalization is a frequently-used method in deep network training. It adjusts the mean and variance of a layer's output, and make the data distribution easier for next layer's training.
The principle of batch normalization can be summarized into a simple function:
```
y = (x - E[x]) / STD[x]) * scale + bias
```
`x` is a batch of output data of a certain layer. `E[x]` and `STD[x]` is the mean and standard deviation of `x`, respectively。 `scale` and `bias` are two trainable parameters. The training of batch normalization layer equals to the learning of best values of `scale` and `bias`.
In our design, we use a single operator(`batch_norm_op`) to implement the whole batch normalization in C++, and wrap it as a layer in Python.
## Differences with normal operators
`batch_norm_op` is a single operator. However, there are a few differences between `BatchNormOp` and normal operators, which we shall take into consideration in our design.
1. `batch_norm_op` shall behave differently in training and inferencing. For example, during inferencing, there is no batch data and it's impossible to compute `E[x]` and `STD[x]`, so we have to use an `estimated_mean` and an `estimated_variance` instead of them. These require our framework to be able to inform operators current running type (training/inferencing), then operators can switch their behaviors.
2. `batch_norm_op` shall have the ability to maintain `estimated_mean` and `estimated_variance` across mini-batch. In each mini-batch, `estimated_mean` is iterated by the following equations:
```
if batch_id == 0
estimated_mean = E[x]
else
estimated_mean = estimated_mean * momentum + (1.0 - momentum_) * E[x]
```
The iterating of `estimated_variance` is similar. `momentum` is an attribute, which controls estimated_mean updating speed.
## Implementation
Batch normalization is designed as a single operator is C++, and then wrapped as a layer in Python.
### C++
As most C++ operators do, `batch_norm_op` is defined by inputs, outputs, attributes and compute kernels.
#### Inputs
- `x`: The inputs data, which is generated by the previous layer.
- `estimated_mean`: The estimated mean of all previous data batches. It is updated in each forward propagation and will be used in inferencing to take the role of `E[x]`.
- `estimated_var`: The estimated standard deviation of all previous data batches. It is updated in each forward propagation and will be used in inferencing to take the role of `STD[x]`.
- `scale`: trainable parameter 'scale'
- `bias`: trainable parameter 'bias'
#### Outputs
- `y`: The output data.
- `batch_mean`: The mean value of batch data.
- `batch_var`: The standard deviation value of batch data.
- `saved_mean`: Updated `estimated_mean` with current batch data. It's supposed to share the memory with input `estimated_mean`.
- `saved_var`: Updated `estimated_var` with current batch data. It's supposed to share the memory with input `estimated_var`.
#### Attributes
- `is_infer`: *bool*. If true, run `batch_norm_op` in inferencing mode.
- `use_global_est`: *bool*. If true, use `saved_mean` and `saved_var` instead of `E[x]` and `STD[x]` in trainning.
- `epsilon`: *float*. The epsilon value to avoid division by zero.
- `momentum`: *float*. Factor used in `estimated_mean` and `estimated_var` updating. The usage is shown above.
#### Kernels
The following graph showes the training computational process of `batch_norm_op`:
cudnn provides APIs to finish the whole series of computation, we can use them in our GPU kernel.
### Python
`batch_norm_op` is warpped as a layer in Python:
```python
def batch_norm_layer(net,
input,
output,
scale,
bias,
use_global_est = False,
epsilon = 1e-6,
momentum = 0.99):
mean_cache = scope.new_var(name = 'estimated_mean', trainable = False)
var_cache = scop.new_var(name = 'estimated_var', trainable = False)
batch_mean = scope.new_var(name = 'batch_mean')
batch_var = scope.new_var(name = 'batch_var')
batch_norm_op = Operator('batch_norm_op',
x = input,
estimated_mean = mean_cache,
estimated_mean = var_cache,
scale = scale,
bias = bias,
y = output,
batch_mean = batch_mean,
batch_var = batch_var,
saved_mean = mean_cache,
saved_var = var_cache,
is_infer = False,
use_global_est = use_global_est,
epsilon = epsilon,
momentum = momentum)
net.append_op(batch_norm_op)
return output
```
Because Python API has not been finally decided, the code above can be regarded as pseudo code. There are a few key points we shall note:
1. `estimated_mean` and `estimated_var` are assigned the same variables with `saved_mean` and `saved_var` respectively. So they share same the memories. The output mean and variance values(`saved_mean` and `saved_var`) of a certain batch will be the inputs(`estimated_mean` and `estimated_var`) of the next batch.
2. `is_infer` decided whether `batch_norm_op` will run in training mode or inferencing mode. However, a network may contains both training and inferencing parts. And user may switch `batch_norm_op`'s running mode in Python `for` loop like this:
```python
for pass_id in range(PASS_NUM):
# ...
net.train() # run training model
if pass_id % 100 == 0:
net.infer(test_image) # run inferencing model
# ...
```
`is_infer` is an attribute. Once an operator is created, its attributes can not be changed. It suggests us that we shall maintain two `batch_norm_op` in the model, one's `is_infer` is `True`(we call it `infer_batch_norm_op`) and the other one's is `False`(we call it `train_batch_norm_op`). They share all parameters and variables, but be placed in two different branches. That is to say, if a network contains a `batch_norm_op`, it will fork into two branches, one go through `train_batch_norm_op` and the other one go through `infer_batch_norm_op`: