!10519 update documentation of warmup_lr, F1, RMSProp, Batchnorm2d and add some pictures of links of activation function.

From: @wangshuide2020 Reviewed-by: @liangchenghui,@wuxuejian Signed-off-by: @liangchenghui
4 years ago · 4dfd143483
parent 0f6eab27e9 4b693377dc
commit 4dfd143483
5 changed files with 55 additions and 38 deletions
--- a/mindspore/nn/dynamic_lr.py
+++ b/mindspore/nn/dynamic_lr.py
@ -343,13 +343,12 @@ def warmup_lr(learning_rate, total_step, step_per_epoch, warmup_epoch):

    Args:
        learning_rate (float): The initial value of learning rate.
-        warmup_steps (int): The warm up steps of learning rate.
-
-    Inputs:
-        Tensor. The current step number.
+        total_step (int): The total number of steps.
+        step_per_epoch (int): The number of steps in per epoch.
+        warmup_epoch (int): A value that determines the epochs of the learning rate is warmed up.

    Returns:
-        Tensor. The learning rate value for the current step.
+        list[float]. The size of list is `total_step`.

    Examples:
        >>> learning_rate = 0.1
--- a/mindspore/nn/layer/activation.py
+++ b/mindspore/nn/layer/activation.py
@ -142,6 +142,9 @@ class ELU(Cell):
        \text{alpha} * (\exp(x_i) - 1), &\text{otherwise.}
        \end{cases}

+    The picture about ELU looks like this `ELU <https://en.wikipedia.org/wiki/
+    Activation_function#/media/File:Activation_elu.svg>`_.
+
    Args:
        alpha (float): The coefficient of negative factor whose type is float. Default: 1.0.

@ -178,6 +181,9 @@ class ReLU(Cell):
    element-wise :math:`\max(0, x)`, specially, the neurons with the negative output
    will be suppressed and the active neurons will stay the same.

+    The picture about ReLU looks like this `ReLU <https://en.wikipedia.org/wiki/
+    Activation_function#/media/File:Activation_rectified_linear.svg>`_.
+
    Inputs:
        - **input_data** (Tensor) - The input of ReLU.

@ -335,6 +341,9 @@ class GELU(Cell):
    :math:`GELU(x_i) = x_i*P(X < x_i)`, where :math:`P` is the cumulative distribution function
    of standard Gaussian distribution and :math:`x_i` is the element of the input.

+    The picture about GELU looks like this `GELU <https://en.wikipedia.org/wiki/
+    Activation_function#/media/File:Activation_gelu.png>`_.
+
    Inputs:
        - **input_data** (Tensor) - The input of GELU.

@ -410,6 +419,9 @@ class Sigmoid(Cell):
    Sigmoid function is defined as:
    :math:`\text{sigmoid}(x_i) = \frac{1}{1 + \exp(-x_i)}`,    where :math:`x_i` is the element of the input.

+    The picture about Sigmoid looks like this `Sigmoid <https://en.wikipedia.org/wiki/
+    Sigmoid_function#/media/File:Logistic-curve.svg>`_.
+
    Inputs:
        - **input_data** (Tensor) - The input of Tanh.

@ -448,6 +460,9 @@ class PReLU(Cell):
    Parameter :math:`w` has dimensionality of the argument channel. If called without argument
    channel, a single parameter :math:`w` will be shared across all channels.

+    The picture about PReLU looks like this `PReLU <https://en.wikipedia.org/wiki/
+    Activation_function#/media/File:Activation_prelu.svg>`_.
+
    Args:
        channel (int): The dimension of input. Default: 1.
        w (float): The initial value of w. Default: 0.25.
--- a/mindspore/nn/layer/normalization.py
+++ b/mindspore/nn/layer/normalization.py
@ -340,6 +340,9 @@ class BatchNorm2d(_BatchNorm):
    Note:
        The implementation of BatchNorm is different in graph mode and pynative mode, therefore that mode can not be
        changed after net was initilized.
+        Note that the formula for updating the running_mean and running_var is
+        :math:`\hat{x}_\text{new} = (1 - \text{momentum}) \times x_t + \text{momentum} \times \hat{x}`,
+        where :math:`\hat{x}` is the estimated statistic and :math:`x_t` is the new observed value.

    Args:
        num_features (int): `C` from an expected input of size (N, C, H, W).
--- a/mindspore/nn/metrics/fbeta.py
+++ b/mindspore/nn/metrics/fbeta.py
@ -122,7 +122,7 @@ class Fbeta(Metric):
 class F1(Fbeta):
    r"""
    Calculates the F1 score. F1 is a special case of Fbeta when beta is 1.
-    Refer to class `Fbeta` for more details.
+    Refer to class :class: `mindspore.nn.Fbeta` for more details.

    .. math::
        F_1=\frac{2\cdot true\_positive}{2\cdot true\_positive + false\_negative + false\_positive}
--- a/mindspore/nn/optim/rmsprop.py
+++ b/mindspore/nn/optim/rmsprop.py
@ -42,51 +42,51 @@ class RMSProp(Optimizer):
    """
    Implements Root Mean Squared Propagation (RMSProp) algorithm.

-    Note:
-        When separating parameter groups, the weight decay in each group will be applied on the parameters if the
-        weight decay is positive. When not separating parameter groups, the `weight_decay` in the API will be applied
-        on the parameters without 'beta' or 'gamma' in their names if `weight_decay` is positive.
+    Update `params` according to the RMSProp algorithm.

-        To improve parameter groups performance, the customized order of parameters can be supported.
+    The equation is as follows:

-        Update `params` according to the RMSProp algorithm.
+    ..  math::
+        s_{t} = \\rho s_{t-1} + (1 - \\rho)(\\nabla Q_{i}(w))^2

-        The equation is as follows:
+    ..  math::
+        m_{t} = \\beta m_{t-1} + \\frac{\\eta} {\\sqrt{s_{t} + \\epsilon}} \\nabla Q_{i}(w)

-        ..  math::
-            s_{t} = \\rho s_{t-1} + (1 - \\rho)(\\nabla Q_{i}(w))^2
+    ..  math::
+        w = w - m_{t}

-        ..  math::
-            m_{t} = \\beta m_{t-1} + \\frac{\\eta} {\\sqrt{s_{t} + \\epsilon}} \\nabla Q_{i}(w)
+    The first equation calculates moving average of the squared gradient for
+    each weight. Then dividing the gradient by :math:`\\sqrt{ms_{t} + \\epsilon}`.

-        ..  math::
-            w = w - m_{t}
+    if centered is True:

-        The first equation calculates moving average of the squared gradient for
-        each weight. Then dividing the gradient by :math:`\\sqrt{ms_{t} + \\epsilon}`.
+    ..  math::
+        g_{t} = \\rho g_{t-1} + (1 - \\rho)\\nabla Q_{i}(w)

-        if centered is True:
+    ..  math::
+        s_{t} = \\rho s_{t-1} + (1 - \\rho)(\\nabla Q_{i}(w))^2

-        ..  math::
-            g_{t} = \\rho g_{t-1} + (1 - \\rho)\\nabla Q_{i}(w)
+    ..  math::
+        m_{t} = \\beta m_{t-1} + \\frac{\\eta} {\\sqrt{s_{t} - g_{t}^2 + \\epsilon}} \\nabla Q_{i}(w)

-        ..  math::
-            s_{t} = \\rho s_{t-1} + (1 - \\rho)(\\nabla Q_{i}(w))^2
+    ..  math::
+        w = w - m_{t}

-        ..  math::
-            m_{t} = \\beta m_{t-1} + \\frac{\\eta} {\\sqrt{s_{t} - g_{t}^2 + \\epsilon}} \\nabla Q_{i}(w)
+    where :math:`w` represents `params`, which will be updated.
+    :math:`g_{t}` is mean gradients, :math:`g_{t-1}` is the last moment of :math:`g_{t}`.
+    :math:`s_{t}` is the mean square gradients, :math:`s_{t-1}` is the last moment of :math:`s_{t}`,
+    :math:`m_{t}` is moment, the delta of `w`, :math:`m_{t-1}` is the last moment of :math:`m_{t}`.
+    :math:`\\rho` represents `decay`. :math:`\\beta` is the momentum term, represents `momentum`.
+    :math:`\\epsilon` is a smoothing term to avoid division by zero, represents `epsilon`.
+    :math:`\\eta` is learning rate, represents `learning_rate`. :math:`\\nabla Q_{i}(w)` is gradientse,
+    represents `gradients`.

-        ..  math::
-            w = w - m_{t}
+    Note:
+        When separating parameter groups, the weight decay in each group will be applied on the parameters if the
+        weight decay is positive. When not separating parameter groups, the `weight_decay` in the API will be applied
+        on the parameters without 'beta' or 'gamma' in their names if `weight_decay` is positive.

-        where :math:`w` represents `params`, which will be updated.
-        :math:`g_{t}` is mean gradients, :math:`g_{t-1}` is the last moment of :math:`g_{t}`.
-        :math:`s_{t}` is the mean square gradients, :math:`s_{t-1}` is the last moment of :math:`s_{t}`,
-        :math:`m_{t}` is moment, the delta of `w`, :math:`m_{t-1}` is the last moment of :math:`m_{t}`.
-        :math:`\\rho` represents `decay`. :math:`\\beta` is the momentum term, represents `momentum`.
-        :math:`\\epsilon` is a smoothing term to avoid division by zero, represents `epsilon`.
-        :math:`\\eta` is learning rate, represents `learning_rate`. :math:`\\nabla Q_{i}(w)` is gradientse,
-        represents `gradients`.
+        To improve parameter groups performance, the customized order of parameters can be supported.

    Args:
        params (Union[list[Parameter], list[dict]]): When the `params` is a list of `Parameter` which will be updated,