!4388 Third round of enhancement of API comment & README_CN

Merge pull request !4388 from Simson/enhancement-API
5 years ago · 15496ff5a4
parent ac4532e664 7cc48a9af8
commit 15496ff5a4
42 changed files with 518 additions and 291 deletions
--- a/README.md
+++ b/README.md
@ -1,7 +1,9 @@
 ![MindSpore Logo](docs/MindSpore-logo.png "MindSpore logo")
 ============================================================

- [What Is MindSpore?](#what-is-mindspore)
+[查看中文](./README_CN.md)
+
+- [What Is MindSpore](#what-is-mindspore)
    - [Automatic Differentiation](#automatic-differentiation)
    - [Automatic Parallel](#automatic-parallel)
 - [Installation](#installation)
--- a/README_CN.md
+++ b/README_CN.md
@ -0,0 +1,220 @@
+![MindSpore标志](docs/MindSpore-logo.png "MindSpore logo")
+============================================================
+
+[View English](./README.md)
+
+- [MindSpore介绍](#mindspore介绍)
+    - [自动微分](#自动微分)
+    - [自动并行](#自动并行)
+- [安装](#安装)
+    - [二进制文件](#二进制文件)
+    - [来源](#来源)
+    - [Docker镜像](#docker镜像)
+- [快速入门](#快速入门)
+- [文档](#文档)
+- [社区](#社区)
+    - [治理](#治理)
+    - [交流](#交流)
+- [贡献](#贡献)
+- [版本说明](#版本说明)
+- [许可证](#许可证)
+
+## MindSpore介绍
+
+MindSpore是一种适用于端边云场景的新型开源深度学习训练/推理框架。
+MindSpore提供了友好的设计和高效的执行，旨在提升数据科学家和算法工程师的开发体验，并为Ascend AI处理器提供原生支持，以及软硬件协同优化。
+
+
+同时，MindSpore作为全球AI开源社区，致力于进一步开发和丰富AI软硬件应用生态。
+
+
+
+<img src="docs/MindSpore-architecture.png" alt="MindSpore Architecture" width="600"/>
+
+欲了解更多详情，请查看我们的[总体架构](https://www.mindspore.cn/docs/zh-CN/master/architecture.html)。
+
+### 自动微分
+
+当前主流深度学习框架中有三种自动微分技术：
+
+- **基于静态计算图的转换**：编译时将网络转换为静态数据流图，将链式法则应用于数据流图，实现自动微分。
+- **基于动态计算图的转换**：记录算子过载正向执行时网络的运行轨迹，对动态生成的数据流图应用链式法则，实现自动微分。
+- **基于源码的转换**：该技术是从功能编程框架演进而来，以即时编译（Just-in-time Compilation，JIT）的形式对中间表达式（程序在编译过程中的表达式）进行自动差分转换，支持复杂的控制流场景、高阶函数和闭包。
+
+TensorFlow早期采用的是静态计算图，PyTorch采用的是动态计算图。静态映射可以利用静态编译技术来优化网络性能，但是构建网络或调试网络非常复杂。动态图的使用非常方便，但很难实现性能的极限优化。
+
+MindSpore找到了另一种方法，即基于源代码转换的自动微分。一方面，它支持自动控制流的自动微分，因此像PyTorch这样的模型构建非常方便。另一方面，MindSpore可以对神经网络进行静态编译优化，以获得更好的性能。
+
+<img src="docs/Automatic-differentiation.png" alt="Automatic Differentiation" width="600"/>
+
+MindSpore自动微分的实现可以理解为程序本身的符号微分。MindSpore IR是一个函数中间表达式，它与基础代数中的复合函数具有直观的对应关系。复合函数的公式由任意可推导的基础函数组成。MindSpore IR中的每个原语操作都可以对应基础代数中的基本功能，从而可以建立更复杂的流控制。
+
+### 自动并行
+
+MindSpore自动并行的目的是构建数据并行、模型并行和混合并行相结合的训练方法。该方法能够自动选择开销最小的模型切分策略，实现自动分布并行训练。
+
+<img src="docs/Automatic-parallel.png" alt="Automatic Parallel" width="600"/>
+
+目前MindSpore采用的是算子切分的细粒度并行策略，即图中的每个算子被切分为一个集群，完成并行操作。在此期间的切分策略可能非常复杂，但是作为一名Python开发者，您无需关注底层实现，只要顶层API计算是有效的即可。
+
+## 安装
+
+### 二进制文件
+
+MindSpore提供跨多个后端的构建选项：
+
+| 硬件平台          | 操作系统            | 状态   |
+| :------------ | :-------------- | :--- |
+| Ascend 910    | Ubuntu-x86      | ✔️   |
+|               | EulerOS-x86     | ✔️   |
+|               | EulerOS-aarch64 | ✔️   |
+| GPU CUDA 10.1 | Ubuntu-x86      | ✔️   |
+| CPU           | Ubuntu-x86      | ✔️   |
+|               | Windows-x86     | ✔️   |
+
+使用`pip`命令安装，以`CPU`和`Ubuntu-x86`build版本为例：
+
+1. 请从[MindSpore下载页面](https://www.mindspore.cn/versions)下载并安装whl包。
+
+    ```
+    pip install https://ms-release.obs.cn-north-4.myhuaweicloud.com/0.6.0-beta/MindSpore/cpu/ubuntu_x86/mindspore-0.6.0-cp37-cp37m-linux_x86_64.whl
+    ```
+
+2. 执行以下命令，验证安装结果。
+
+    ```python
+    import numpy as np
+    import mindspore.context as context
+    import mindspore.nn as nn
+    from mindspore import Tensor
+    from mindspore.ops import operations as P
+    
+    context.set_context(mode=context.GRAPH_MODE, device_target="CPU")
+    
+    class Mul(nn.Cell):
+        def __init__(self):
+            super(Mul, self).__init__()
+            self.mul = P.Mul()
+    
+        def construct(self, x, y):
+            return self.mul(x, y)
+    
+    x = Tensor(np.array([1.0, 2.0, 3.0]).astype(np.float32))
+    y = Tensor(np.array([4.0, 5.0, 6.0]).astype(np.float32))
+    
+    mul = Mul()
+    print(mul(x, y))
+    ```
+    ```
+    [ 4. 10. 18.]
+    ```
+### 来源
+
+[MindSpore安装](https://www.mindspore.cn/install)。
+
+### Docker镜像
+
+MindSpore的Docker镜像托管在[Docker Hub](https://hub.docker.com/r/mindspore)上。
+目前容器化构建选项支持情况如下：
+
+| 硬件平台   | Docker镜像仓库                | 标签                       | 说明                                       |
+| :----- | :------------------------ | :----------------------- | :--------------------------------------- |
+| CPU    | `mindspore/mindspore-cpu` | `x.y.z`                  | 已经预安装MindSpore `x.y.z` CPU版本的生产环境。       |
+|        |                           | `devel`                  | 提供开发环境从源头构建MindSpore（`CPU`后端）。安装详情请参考https://www.mindspore.cn/install。 |
+|        |                           | `runtime`                | 提供运行时环境安装MindSpore二进制包（`CPU`后端）。         |
+| GPU    | `mindspore/mindspore-gpu` | `x.y.z`                  | 已经预安装MindSpore `x.y.z` GPU版本的生产环境。       |
+|        |                           | `devel`                  | 提供开发环境从源头构建MindSpore（`GPU CUDA10.1`后端）。安装详情请参考https://www.mindspore.cn/install。 |
+|        |                           | `runtime`                | 提供运行时环境安装MindSpore二进制包（`GPU CUDA10.1`后端）。 |
+| Ascend | <center>&mdash;</center>  | <center>&mdash;</center> | 即将推出，敬请期待。                               |
+
+> **注意：** 不建议从源头构建GPU `devel` Docker镜像后直接安装whl包。我们强烈建议您在GPU `runtime` Docker镜像中传输并安装whl包。
+
+* CPU
+
+    对于`CPU`后端，可以直接使用以下命令获取并运行最新的稳定镜像：
+    ```
+    docker pull mindspore/mindspore-cpu:0.6.0-beta
+    docker run -it mindspore/mindspore-cpu:0.6.0-beta /bin/bash
+    ```
+
+* GPU
+
+    对于`GPU`后端，请确保`nvidia-container-toolkit`已经提前安装，以下是`Ubuntu`用户安装指南：
+    ```
+    DISTRIBUTION=$(. /etc/os-release; echo $ID$VERSION_ID)
+    curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | apt-key add -
+    curl -s -L https://nvidia.github.io/nvidia-docker/$DISTRIBUTION/nvidia-docker.list | tee /etc/apt/sources.list.d/nvidia-docker.list
+
+    sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit nvidia-docker2
+    sudo systemctl restart docker
+    ```
+
+    使用以下命令获取并运行最新的稳定镜像：
+    ```
+    docker pull mindspore/mindspore-gpu:0.6.0-beta
+    docker run -it --runtime=nvidia --privileged=true mindspore/mindspore-gpu:0.6.0-beta /bin/bash
+    ```
+
+    要测试Docker是否正常工作，请运行下面的Python代码并检查输出：
+    ```python
+    import numpy as np
+    import mindspore.context as context
+    from mindspore import Tensor
+    from mindspore.ops import functional as F
+
+    context.set_context(device_target="GPU")
+
+    x = Tensor(np.ones([1,3,3,4]).astype(np.float32))
+    y = Tensor(np.ones([1,3,3,4]).astype(np.float32))
+    print(F.tensor_add(x, y))
+    ```
+    ```
+    [[[ 2.  2.  2.  2.],
+    [ 2.  2.  2.  2.],
+    [ 2.  2.  2.  2.]],
+
+    [[ 2.  2.  2.  2.],
+    [ 2.  2.  2.  2.],
+    [ 2.  2.  2.  2.]],
+
+    [[ 2.  2.  2.  2.],
+    [ 2.  2.  2.  2.],
+    [ 2.  2.  2.  2.]]]
+    ```
+
+如果您想了解更多关于MindSpore Docker镜像的构建过程，请查看[docker](docker/README.md) repo了解详细信息。
+
+## 快速入门
+
+参考[快速入门](https://www.mindspore.cn/tutorial/zh-CN/master/quick_start/quick_start.html)实现图片分类。
+
+
+## 文档
+
+有关安装指南、教程和API的更多详细信息，请参阅[用户文档](https://gitee.com/mindspore/docs)。
+
+## 社区
+
+### 治理
+
+查看MindSpore如何进行[开放治理](https://gitee.com/mindspore/community/blob/master/governance.md)。
+
+### 交流
+
+- [MindSpore Slack](https://join.slack.com/t/mindspore/shared_invite/zt-dgk65rli-3ex4xvS4wHX7UDmsQmfu8w) 开发者交流平台。
+- `#mindspore`IRC频道（仅用于会议记录）
+- 视频会议：待定
+- 邮件列表：<https://mailweb.mindspore.cn/postorius/lists>
+
+## 贡献
+
+欢迎参与贡献。更多详情，请参阅我们的[贡献者Wiki](CONTRIBUTING.md)。
+
+
+## 版本说明
+
+版本说明请参阅[RELEASE](RELEASE.md)。
+
+## 许可证
+
+[Apache License 2.0](LICENSE)
--- a/mindspore/ccsrc/pybind_api/ir/tensor_py.cc
+++ b/mindspore/ccsrc/pybind_api/ir/tensor_py.cc
@ -150,7 +150,7 @@ TensorPtr TensorPy::MakeTensor(const py::array &input, const TypePtr &type_ptr)
  // Get tensor shape.
  std::vector<int> shape(buf.shape.begin(), buf.shape.end());
  if (data_type == buf_type) {
-    // Use memory copy if input data type is same as the required type.
+    // Use memory copy if input data type is the same as the required type.
    return std::make_shared<Tensor>(data_type, shape, buf.ptr, buf.size * buf.itemsize);
  }
  // Create tensor with data type converted.
--- a/mindspore/context.py
+++ b/mindspore/context.py
@ -546,9 +546,11 @@ def set_context(**kwargs):

    Note:
        Attribute name is required for setting attributes.
+        The mode is not recommended to be changed after net was initilized because the implementations of some
+        operations are different in graph mode and pynative mode. Default: PYNATIVE_MODE.

    Args:
-        mode (int): Running in GRAPH_MODE(0) or PYNATIVE_MODE(1). Default: PYNATIVE_MODE.
+        mode (int): Running in GRAPH_MODE(0) or PYNATIVE_MODE(1).
        device_target (str): The target device to run, support "Ascend", "GPU", "CPU". Default: "Ascend".
        device_id (int): Id of target device, the value must be in [0, device_num_per_host-1],
                    while device_num_per_host should no more than 4096. Default: 0.
--- a/mindspore/nn/cell.py
+++ b/mindspore/nn/cell.py
@ -148,7 +148,7 @@ class Cell:

    def update_cell_type(self, cell_type):
        """
-        Update the current cell type mainly identify if quantization aware training network.
+        The current cell type is updated when a quantization aware training network is encountered.

        After being invoked, it can set the cell type to 'cell_type'.
        """
@ -936,7 +936,7 @@ class GraphKernel(Cell):
    Base class for GraphKernel.

    A `GraphKernel` a composite of basic primitives and can be compiled into a fused kernel automatically when
-    context.set_context(enable_graph_kernel=True).
+    enable_graph_kernel in context is set to True.

    Examples:
        >>> class Relu(GraphKernel):
--- a/mindspore/nn/graph_kernels/graph_kernels.py
+++ b/mindspore/nn/graph_kernels/graph_kernels.py
@ -661,7 +661,7 @@ class LogSoftmax(GraphKernel):
    Log Softmax activation function.

    Applies the Log Softmax function to the input tensor on the specified axis.
-    Suppose a slice along the given aixs :math:`x` then for each element :math:`x_i`
+    Suppose a slice in the given aixs :math:`x` then for each element :math:`x_i`
    the Log Softmax function is shown as follows:

    .. math::
@ -987,10 +987,10 @@ class LayerNorm(Cell):
    Applies Layer Normalization over a mini-batch of inputs.

    Layer normalization is widely used in recurrent neural networks. It applies
-    normalization over a mini-batch of inputs for each single training case as described
+    normalization on a mini-batch of inputs for each single training case as described
    in the paper `Layer Normalization <https://arxiv.org/pdf/1607.06450.pdf>`_. Unlike batch
    normalization, layer normalization performs exactly the same computation at training and
-    testing times. It can be described using the following formula. It is applied across all channels
+    testing time. It can be described using the following formula. It is applied across all channels
    and pixel but only one batch size.

    .. math::
@ -1139,9 +1139,9 @@ class LambNextMV(GraphKernel):
    Outputs:
        Tuple of 2 Tensor.

-        - **add3** (Tensor) - The shape is same as the shape after broadcasting, and the data type is
+        - **add3** (Tensor) - The shape is the same as the shape after broadcasting, and the data type is
                              the one with high precision or high digits among the inputs.
-        - **realdiv4** (Tensor) - The shape is same as the shape after broadcasting, and the data type is
+        - **realdiv4** (Tensor) - The shape is the same as the shape after broadcasting, and the data type is
                                  the one with high precision or high digits among the inputs.

    Examples:
--- a/mindspore/nn/layer/activation.py
+++ b/mindspore/nn/layer/activation.py
@ -55,7 +55,7 @@ class Softmax(Cell):
    .. math::
        \text{softmax}(x_{i}) =  \frac{\exp(x_i)}{\sum_{j=0}^{n-1}\exp(x_j)},

-    where :math:`x_{i}` is the :math:`i`-th slice along the given dim of the input Tensor.
+    where :math:`x_{i}` is the :math:`i`-th slice in the given dimension of the input Tensor.

    Args:
        axis (Union[int, tuple[int]]): The axis to apply Softmax operation, -1 means the last dimension. Default: -1.
@ -87,11 +87,11 @@ class LogSoftmax(Cell):

    Applies the LogSoftmax function to n-dimensional input tensor.

-    The input is transformed with Softmax function and then with log function to lie in range[-inf,0).
+    The input is transformed by the Softmax function and then by the log function to lie in range[-inf,0).

    Logsoftmax is defined as:
    :math:`\text{logsoftmax}(x_i) = \log \left(\frac{\exp(x_i)}{\sum_{j=0}^{n-1} \exp(x_j)}\right)`,
-    where :math:`x_{i}` is the :math:`i`-th slice along the given dim of the input Tensor.
+    where :math:`x_{i}` is the :math:`i`-th slice in the given dimension of the input Tensor.

    Args:
        axis (int): The axis to apply LogSoftmax operation, -1 means the last dimension. Default: -1.
@ -123,7 +123,7 @@ class ELU(Cell):
    Exponential Linear Uint activation function.

    Applies the exponential linear unit function element-wise.
-    The activation function defined as:
+    The activation function is defined as:

    .. math::
        E_{i} =
@ -162,7 +162,7 @@ class ReLU(Cell):

    Applies the rectified linear unit function element-wise. It returns
    element-wise :math:`\max(0, x)`, specially, the neurons with the negative output
-    will suppressed and the active neurons will stay the same.
+    will be suppressed and the active neurons will stay the same.

    Inputs:
        - **input_data** (Tensor) - The input of ReLU.
@ -197,7 +197,7 @@ class ReLU6(Cell):
        - **input_data** (Tensor) - The input of ReLU6.

    Outputs:
-        Tensor, which has the same type with `input_data`.
+        Tensor, which has the same type as `input_data`.

    Examples:
        >>> input_x = Tensor(np.array([-1, -2, 0, 2, 1]), mindspore.float16)
@ -234,7 +234,7 @@ class LeakyReLU(Cell):
        - **input_x** (Tensor) - The input of LeakyReLU.

    Outputs:
-        Tensor, has the same type and shape with the `input_x`.
+        Tensor, has the same type and shape as the `input_x`.

    Examples:
        >>> input_x = Tensor(np.array([[-1.0, 4.0, -8.0], [2.0, -5.0, 9.0]]), mindspore.float32)
@ -365,7 +365,7 @@ class PReLU(Cell):
    PReLU is defined as: :math:`prelu(x_i)= \max(0, x_i) + w * \min(0, x_i)`, where :math:`x_i`
    is an element of an channel of the input.

-    Here :math:`w` is an learnable parameter with default initial value 0.25.
+    Here :math:`w` is a learnable parameter with a default initial value 0.25.
    Parameter :math:`w` has dimensionality of the argument channel. If called without argument
    channel, a single parameter :math:`w` will be shared across all channels.

@ -413,7 +413,7 @@ class PReLU(Cell):

 class HSwish(Cell):
    r"""
-    rHard swish activation function.
+    Hard swish activation function.

    Applies hswish-type activation element-wise. The input is a Tensor with any valid shape.

@ -422,7 +422,7 @@ class HSwish(Cell):
    .. math::
        \text{hswish}(x_{i}) = x_{i} * \frac{ReLU6(x_{i} + 3)}{6},

-    where :math:`x_{i}` is the :math:`i`-th slice along the given dim of the input Tensor.
+    where :math:`x_{i}` is the :math:`i`-th slice in the given dimension of the input Tensor.

    Inputs:
        - **input_data** (Tensor) - The input of HSwish.
@ -456,7 +456,7 @@ class HSigmoid(Cell):
    .. math::
        \text{hsigmoid}(x_{i}) = max(0, min(1, \frac{x_{i} + 3}{6})),

-    where :math:`x_{i}` is the :math:`i`-th slice along the given dim of the input Tensor.
+    where :math:`x_{i}` is the :math:`i`-th slice in the given dimension of the input Tensor.

    Inputs:
        - **input_data** (Tensor) - The input of HSigmoid.
--- a/mindspore/nn/layer/basic.py
+++ b/mindspore/nn/layer/basic.py
@ -65,7 +65,7 @@ class Dropout(Cell):
        dtype (:class:`mindspore.dtype`): Data type of input. Default: mindspore.float32.

    Raises:
-        ValueError: If keep_prob is not in range (0, 1).
+        ValueError: If `keep_prob` is not in range (0, 1).

    Inputs:
        - **input** (Tensor) - An N-D Tensor.
@ -373,8 +373,8 @@ class OneHot(Cell):
        axis is created at dimension `axis`.

    Args:
-        axis (int): Features x depth if axis == -1, depth x features
-                    if axis == 0. Default: -1.
+        axis (int): Features x depth if axis is -1, depth x features
+                    if axis is 0. Default: -1.
        depth (int): A scalar defining the depth of the one hot dimension. Default: 1.
        on_value (float): A scalar defining the value to fill in output[i][j]
                          when indices[j] = i. Default: 1.0.
@ -492,18 +492,18 @@ class Unfold(Cell):
    The input tensor must be a 4-D tensor and the data format is NCHW.

    Args:
-        ksizes (Union[tuple[int], list[int]]): The size of sliding window, should be a tuple or list of int,
+        ksizes (Union[tuple[int], list[int]]): The size of sliding window, should be a tuple or a list of integers,
            and the format is [1, ksize_row, ksize_col, 1].
        strides (Union[tuple[int], list[int]]): Distance between the centers of the two consecutive patches,
            should be a tuple or list of int, and the format is [1, stride_row, stride_col, 1].
-        rates (Union[tuple[int], list[int]]): In each extracted patch, the gap between the corresponding dim
-            pixel positions, should be a tuple or list of int, and the format is [1, rate_row, rate_col, 1].
+        rates (Union[tuple[int], list[int]]): In each extracted patch, the gap between the corresponding dimension
+            pixel positions, should be a tuple or a list of integers, and the format is [1, rate_row, rate_col, 1].
        padding (str): The type of padding algorithm, is a string whose value is "same" or "valid",
            not case sensitive. Default: "valid".

            - same: Means that the patch can take the part beyond the original image, and this part is filled with 0.

-            - valid: Means that the patch area taken must be completely contained in the original image.
+            - valid: Means that the taken patch area must be completely covered in the original image.

    Inputs:
        - **input_x** (Tensor) - A 4-D tensor whose shape is [in_batch, in_depth, in_row, in_col] and
@ -511,7 +511,7 @@ class Unfold(Cell):

    Outputs:
        Tensor, a 4-D tensor whose data type is same as 'input_x',
-        and the shape is [out_batch, out_depth, out_row, out_col], the out_batch is same as the in_batch.
+        and the shape is [out_batch, out_depth, out_row, out_col], the out_batch is the same as the in_batch.

    Examples:
        >>> net = Unfold(ksizes=[1, 2, 2, 1], strides=[1, 1, 1, 1], rates=[1, 1, 1, 1])
@ -556,11 +556,11 @@ class MatrixDiag(Cell):
    Returns a batched diagonal tensor with a given batched diagonal values.

    Inputs:
-        - **x** (Tensor) - The diagonal values. It can be of the following data types:
-          float32, float16, int32, int8, uint8.
+        - **x** (Tensor) - The diagonal values. It can be one of the following data types:
+          float32, float16, int32, int8, and uint8.

    Outputs:
-        Tensor, same type as input `x`. The shape should be x.shape + (x.shape[-1], ).
+        Tensor, has the same type as input `x`. The shape should be x.shape + (x.shape[-1], ).

    Examples:
        >>> x = Tensor(np.array([1, -1]), mstype.float32)
@ -587,11 +587,11 @@ class MatrixDiagPart(Cell):
    Returns the batched diagonal part of a batched tensor.

    Inputs:
-        - **x** (Tensor) - The batched tensor. It can be of the following data types:
-          float32, float16, int32, int8, uint8.
+        - **x** (Tensor) - The batched tensor. It can be one of the following data types:
+          float32, float16, int32, int8, and uint8.

    Outputs:
-        Tensor, same type as input `x`. The shape should be x.shape[:-2] + [min(x.shape[-2:])].
+        Tensor, has the same type as input `x`. The shape should be x.shape[:-2] + [min(x.shape[-2:])].

    Examples:
        >>> x = Tensor([[[-1, 0], [0, 1]], [[-1, 0], [0, 1]], [[-1, 0], [0, 1]]], mindspore.float32)
@ -617,12 +617,12 @@ class MatrixSetDiag(Cell):
    Modify the batched diagonal part of a batched tensor.

    Inputs:
-        - **x** (Tensor) - The batched tensor. It can be of the following data types:
-          float32, float16, int32, int8, uint8.
+        - **x** (Tensor) - The batched tensor. It can be one of the following data types:
+          float32, float16, int32, int8, and uint8.
        - **diagonal** (Tensor) - The diagonal values.

    Outputs:
-        Tensor, same type as input `x`. The shape same as `x`.
+        Tensor, has the same type and shape as input `x`.

    Examples:
        >>> x = Tensor([[[-1, 0], [0, 1]], [[-1, 0], [0, 1]], [[-1, 0], [0, 1]]], mindspore.float32)
--- a/mindspore/nn/layer/container.py
+++ b/mindspore/nn/layer/container.py
@ -72,7 +72,7 @@ class SequentialCell(Cell):
        args (list, OrderedDict): List of subclass of Cell.

    Raises:
-        TypeError: If arg is not of type list or OrderedDict.
+        TypeError: If the type of the argument is not list or OrderedDict.

    Inputs:
        - **input** (Tensor) - Tensor with shape according to the first Cell in the sequence.
--- a/mindspore/nn/layer/conv.py
+++ b/mindspore/nn/layer/conv.py
@ -131,7 +131,7 @@ class Conv2d(_Conv):
    Args:
        in_channels (int): The number of input channel :math:`C_{in}`.
        out_channels (int): The number of output channel :math:`C_{out}`.
-        kernel_size (Union[int, tuple[int]]): The data type is int or tuple with 2 integers. Specifies the height
+        kernel_size (Union[int, tuple[int]]): The data type is int or a tuple of 2 integers. Specifies the height
            and width of the 2D convolution window. Single int means the value is for both the height and the width of
            the kernel. A tuple of 2 ints means the first value is for the height and the other is for the
            width of the kernel.
@ -147,7 +147,7 @@ class Conv2d(_Conv):
              last extra padding will be done from the bottom and the right side. If this mode is set, `padding`
              must be 0.

-            - valid: Adopts the way of discarding. The possibly largest height and width of output will be returned
+            - valid: Adopts the way of discarding. The possible largest height and width of output will be returned
              without padding. Extra pixels will be discarded. If this mode is set, `padding`
              must be 0.

@ -158,7 +158,7 @@ class Conv2d(_Conv):
                    the padding of top, bottom, left and right is the same, equal to padding. If `padding` is a tuple
                    with four integers, the padding of top, bottom, left and right will be equal to padding[0],
                    padding[1], padding[2], and padding[3] accordingly. Default: 0.
-        dilation (Union[int, tuple[int]]): The data type is int or tuple with 2 integers. Specifies the dilation rate
+        dilation (Union[int, tuple[int]]): The data type is int or a tuple of 2 integers. Specifies the dilation rate
                                      to use for dilated convolution. If set to be :math:`k > 1`, there will
                                      be :math:`k - 1` pixels skipped for each sampling location. Its value should
                                      be greater or equal to 1 and bounded by the height and width of the
@ -451,7 +451,7 @@ class Conv2dTranspose(_Conv):
    Args:
        in_channels (int): The number of channels in the input space.
        out_channels (int): The number of channels in the output space.
-        kernel_size (Union[int, tuple]): int or tuple with 2 integers, which specifies the  height
+        kernel_size (Union[int, tuple]): int or a tuple of 2 integers, which specifies the  height
            and width of the 2D convolution window. Single int means the value is for both the height and the width of
            the kernel. A tuple of 2 ints means the first value is for the height and the other is for the
            width of the kernel.
@ -825,7 +825,7 @@ class DepthwiseConv2d(Cell):
    Args:
        in_channels (int): The number of input channel :math:`C_{in}`.
        out_channels (int): The number of output channel :math:`C_{out}`.
-        kernel_size (Union[int, tuple[int]]): The data type is int or tuple with 2 integers. Specifies the height
+        kernel_size (Union[int, tuple[int]]): The data type is int or a tuple of 2 integers. Specifies the height
            and width of the 2D convolution window. Single int means the value is for both the height and the width of
            the kernel. A tuple of 2 ints means the first value is for the height and the other is for the
            width of the kernel.
@ -841,7 +841,7 @@ class DepthwiseConv2d(Cell):
              last extra padding will be done from the bottom and the right side. If this mode is set, `padding`
              must be 0.

-            - valid: Adopts the way of discarding. The possibly largest height and width of output will be returned
+            - valid: Adopts the way of discarding. The possible largest height and width of output will be returned
              without padding. Extra pixels will be discarded. If this mode is set, `padding`
              must be 0.

@ -849,16 +849,16 @@ class DepthwiseConv2d(Cell):
              Tensor borders. `padding` should be greater than or equal to 0.

        padding (int): Implicit paddings on both sides of the input. Default: 0.
-        dilation (Union[int, tuple[int]]): The data type is int or tuple with 2 integers. Specifies the dilation rate
+        dilation (Union[int, tuple[int]]): The data type is int or a tuple of 2 integers. Specifies the dilation rate
                                      to use for dilated convolution. If set to be :math:`k > 1`, there will
                                      be :math:`k - 1` pixels skipped for each sampling location. Its value should
-                                      be greater or equal to 1 and bounded by the height and width of the
+                                      be greater than or equal to 1 and bounded by the height and width of the
                                      input. Default: 1.
        group (int): Split filter into groups, `in_ channels` and `out_channels` should be
            divisible by the number of groups. Default: 1.
        has_bias (bool): Specifies whether the layer uses a bias vector. Default: False.
        weight_init (Union[Tensor, str, Initializer, numbers.Number]): Initializer for the convolution kernel.
-            It can be a Tensor, a string, an Initializer or a numbers.Number. When a string is specified,
+            It can be a Tensor, a string, an Initializer or a number. When a string is specified,
            values from 'TruncatedNormal', 'Normal', 'Uniform', 'HeUniform' and 'XavierUniform' distributions as well
            as constant 'One' and 'Zero' distributions are possible. Alias 'xavier_uniform', 'he_uniform', 'ones'
            and 'zeros' are acceptable. Uppercase and lowercase are both acceptable. Refer to the values of
--- a/mindspore/nn/layer/embedding.py
+++ b/mindspore/nn/layer/embedding.py
@ -36,7 +36,7 @@ class Embedding(Cell):
    the corresponding word embeddings.

    Note:
-        When 'use_one_hot' is set to True, the input should be of type mindspore.int32.
+        When 'use_one_hot' is set to True, the type of the input should be mindspore.int32.

    Args:
        vocab_size (int): Size of the dictionary of embeddings.
@ -48,9 +48,9 @@ class Embedding(Cell):
        dtype (:class:`mindspore.dtype`): Data type of input. Default: mindspore.float32.

    Inputs:
-        - **input** (Tensor) - Tensor of shape :math:`(\text{batch_size}, \text{input_length})`. The element of
-          the Tensor should be integer and not larger than vocab_size. else the corresponding embedding vector is zero
-          if larger than vocab_size.
+        - **input** (Tensor) - Tensor of shape :math:`(\text{batch_size}, \text{input_length})`. The elements of
+          the Tensor should be integer and not larger than vocab_size. Otherwise the corresponding embedding vector will
+          be zero.

    Outputs:
        Tensor of shape :math:`(\text{batch_size}, \text{input_length}, \text{embedding_size})`.
--- a/mindspore/nn/layer/image.py
+++ b/mindspore/nn/layer/image.py
@ -253,7 +253,7 @@ class MSSSIM(Cell):
    Args:
        max_val (Union[int, float]): The dynamic range of the pixel values (255 for 8-bit grayscale images).
          Default: 1.0.
-        power_factors (Union[tuple, list]): Iterable of weights for each of the scales.
+        power_factors (Union[tuple, list]): Iterable of weights for each scal e.
          Default: (0.0448, 0.2856, 0.3001, 0.2363, 0.1333). Default values obtained by Wang et al.
        filter_size (int): The size of the Gaussian filter. Default: 11.
        filter_sigma (float): The standard deviation of Gaussian kernel. Default: 1.5.
--- a/mindspore/nn/layer/lstm.py
+++ b/mindspore/nn/layer/lstm.py
@ -35,7 +35,7 @@ class LSTM(Cell):
    Applies a LSTM to the input.

    There are two pipelines connecting two consecutive cells in a LSTM model; one is cell state pipeline
-    and another is hidden state pipeline. Denote two consecutive time nodes as :math:`t-1` and :math:`t`.
+    and the other is hidden state pipeline. Denote two consecutive time nodes as :math:`t-1` and :math:`t`.
    Given an input :math:`x_t` at time :math:`t`, an hidden state :math:`h_{t-1}` and an cell
    state :math:`c_{t-1}` of the layer at time :math:`{t-1}`, the cell state and hidden state at
    time :math:`t` is computed using an gating mechanism. Input gate :math:`i_t` is designed to protect the cell
@ -68,18 +68,17 @@ class LSTM(Cell):
        input_size (int): Number of features of input.
        hidden_size (int):  Number of features of hidden layer.
        num_layers (int): Number of layers of stacked LSTM . Default: 1.
-        has_bias (bool): Specifies whether has bias `b_ih` and `b_hh`. Default: True.
+        has_bias (bool): Whether the cell has bias `b_ih` and `b_hh`. Default: True.
        batch_first (bool): Specifies whether the first dimension of input is batch_size. Default: False.
        dropout (float, int): If not 0, append `Dropout` layer on the outputs of each
            LSTM layer except the last layer. Default 0. The range of dropout is [0.0, 1.0].
-        bidirectional (bool): Specifies whether this is a bidirectional LSTM. If set True,
-            number of directions will be 2 otherwise number of directions is 1. Default: False.
+        bidirectional (bool): Specifies whether it is a bidirectional LSTM. Default: False.

    Inputs:
        - **input** (Tensor) - Tensor of shape (seq_len, batch_size, `input_size`).
        - **hx** (tuple) - A tuple of two Tensors (h_0, c_0) both of data type mindspore.float32 or
          mindspore.float16 and shape (num_directions * `num_layers`, batch_size, `hidden_size`).
-          Data type of `hx` should be the same of `input`.
+          Data type of `hx` should be the same as `input`.

    Outputs:
        Tuple, a tuple constains (`output`, (`h_n`, `c_n`)).
@ -205,7 +204,7 @@ class LSTMCell(Cell):
    Applies a LSTM layer to the input.

    There are two pipelines connecting two consecutive cells in a LSTM model; one is cell state pipeline
-    and another is hidden state pipeline. Denote two consecutive time nodes as :math:`t-1` and :math:`t`.
+    and the other is hidden state pipeline. Denote two consecutive time nodes as :math:`t-1` and :math:`t`.
    Given an input :math:`x_t` at time :math:`t`, an hidden state :math:`h_{t-1}` and an cell
    state :math:`c_{t-1}` of the layer at time :math:`{t-1}`, the cell state and hidden state at
    time :math:`t` is computed using an gating mechanism. Input gate :math:`i_t` is designed to protect the cell
@ -238,7 +237,7 @@ class LSTMCell(Cell):
        input_size (int): Number of features of input.
        hidden_size (int):  Number of features of hidden layer.
        layer_index (int): index of current layer of stacked LSTM . Default: 0.
-        has_bias (bool): Specifies whether has bias `b_ih` and `b_hh`. Default: True.
+        has_bias (bool): Whether the cell has bias `b_ih` and `b_hh`. Default: True.
        batch_first (bool): Specifies whether the first dimension of input is batch_size. Default: False.
        dropout (float, int): If not 0, append `Dropout` layer on the outputs of each
            LSTM layer except the last layer. Default 0. The range of dropout is [0.0, 1.0].
--- a/mindspore/nn/layer/normalization.py
+++ b/mindspore/nn/layer/normalization.py
@ -243,6 +243,10 @@ class BatchNorm1d(_BatchNorm):
    .. math::
        y = \frac{x - \mathrm{E}[x]}{\sqrt{\mathrm{Var}[x] + \epsilon}} * \gamma + \beta

+    Note:
+        The implementation of BatchNorm is different in graph mode and pynative mode, therefore the mode is not
+        recommended to be changed after net was initilized.
+
    Args:
        num_features (int): `C` from an expected input of size (N, C).
        eps (float): A value added to the denominator for numerical stability. Default: 1e-5.
@ -319,6 +323,10 @@ class BatchNorm2d(_BatchNorm):
    .. math::
        y = \frac{x - \mathrm{E}[x]}{\sqrt{\mathrm{Var}[x] + \epsilon}} * \gamma + \beta

+    Note:
+        The implementation of BatchNorm is different in graph mode and pynative mode, therefore that mode can not be
+        changed after net was initilized.
+
    Args:
        num_features (int): `C` from an expected input of size (N, C, H, W).
        eps (float): A value added to the denominator for numerical stability. Default: 1e-5.
@ -384,8 +392,8 @@ class GlobalBatchNorm(_BatchNorm):
    r"""
    Global normalization layer over a N-dimension input.

-    Global Normalization is cross device synchronized batch normalization. Batch Normalization implementation
-    only normalize the data within each device. Global normalization will normalize the input within the group.
+    Global Normalization is cross device synchronized batch normalization. The implementation of Batch Normalization
+    only normalizes the data within each device. Global normalization will normalize the input within the group.
    It has been described in the paper `Batch Normalization: Accelerating Deep Network Training by
    Reducing Internal Covariate Shift <https://arxiv.org/abs/1502.03167>`_. It rescales and recenters the
    feature using a mini-batch of data and the learned parameters which can be described in the following formula.
@ -467,10 +475,10 @@ class LayerNorm(Cell):
    Applies Layer Normalization over a mini-batch of inputs.

    Layer normalization is widely used in recurrent neural networks. It applies
-    normalization over a mini-batch of inputs for each single training case as described
+    normalization on a mini-batch of inputs for each single training case as described
    in the paper `Layer Normalization <https://arxiv.org/pdf/1607.06450.pdf>`_. Unlike batch
    normalization, layer normalization performs exactly the same computation at training and
-    testing times. It can be described using the following formula. It is applied across all channels
+    testing time. It can be described using the following formula. It is applied across all channels
    and pixel but only one batch size.

    .. math::
@ -545,7 +553,7 @@ class GroupNorm(Cell):
    Group Normalization over a mini-batch of inputs.

    Group normalization is widely used in recurrent neural networks. It applies
-    normalization over a mini-batch of inputs for each single training case as described
+    normalization on a mini-batch of inputs for each single training case as described
    in the paper `Group Normalization <https://arxiv.org/pdf/1803.08494.pdf>`_. Group normalization
    divides the channels into groups and computes within each group the mean and variance for normalization,
    and it performs very stable over a wide range of batch size. It can be described using the following formula.
@ -557,7 +565,7 @@ class GroupNorm(Cell):
        num_groups (int): The number of groups to be divided along the channel dimension.
        num_channels (int): The number of channels per group.
        eps (float): A value added to the denominator for numerical stability. Default: 1e-5.
-        affine (bool): A bool value, this layer will has learnable affine parameters when set to true. Default: True.
+        affine (bool): A bool value, this layer will have learnable affine parameters when set to true. Default: True.
        gamma_init (Union[Tensor, str, Initializer, numbers.Number]): Initializer for the gamma weight.
            The values of str refer to the function `initializer` including 'zeros', 'ones', 'xavier_uniform',
            'he_uniform', etc. Default: 'ones'.
--- a/mindspore/nn/layer/quant.py
+++ b/mindspore/nn/layer/quant.py
@ -61,7 +61,7 @@ class Conv2dBnAct(Cell):
    Args:
        in_channels (int): The number of input channel :math:`C_{in}`.
        out_channels (int): The number of output channel :math:`C_{out}`.
-        kernel_size (Union[int, tuple]): The data type is int or tuple with 2 integers. Specifies the height
+        kernel_size (Union[int, tuple]): The data type is int or a tuple of 2 integers. Specifies the height
            and width of the 2D convolution window. Single int means the value is for both height and width of
            the kernel. A tuple of 2 ints means the first value is for the height and the other is for the
            width of the kernel.
@ -292,19 +292,19 @@ class BatchNormFoldCell(Cell):

 class FakeQuantWithMinMax(Cell):
    r"""
-    Quantization aware op. This OP provide Fake quantization observer function on data with min and max.
+    Quantization aware op. This OP provides the fake quantization observer function on data with min and max.

    Args:
        min_init (int, float): The dimension of channel or 1(layer). Default: -6.
        max_init (int, float): The dimension of channel or 1(layer). Default: 6.
-        ema (bool): Exponential Moving Average algorithm update min and max. Default: False.
+        ema (bool): The exponential Moving Average algorithm updates min and max. Default: False.
        ema_decay (float): Exponential Moving Average algorithm parameter. Default: 0.999.
        per_channel (bool):  Quantization granularity based on layer or on channel. Default: False.
        channel_axis (int): Quantization by channel axis. Default: 1.
        num_channels (int): declarate the min and max channel size, Default: 1.
-        num_bits (int): The quantization number bit, support 4 and 8bit. Default: 8.
-        symmetric (bool): The quantization algorithm is symmetric or not. Default: False.
-        narrow_range (bool): The quantization algorithm uses narrow range or not. Default: False.
+        num_bits (int): The bit number of quantization, supporting 4 and 8bits. Default: 8.
+        symmetric (bool): Whether the quantization algorithm is symmetric or not. Default: False.
+        narrow_range (bool): Whether the quantization algorithm uses narrow range or not. Default: False.
        quant_delay (int): Quantization delay parameters according to the global step. Default: 0.

    Inputs:
@ -431,7 +431,7 @@ class Conv2dBnFoldQuant(Cell):
            variance vector. Default: 'ones'.
        fake (bool): Whether Conv2dBnFoldQuant Cell adds FakeQuantWithMinMax op. Default: True.
        per_channel (bool): FakeQuantWithMinMax Parameters. Default: False.
-        num_bits (int): The quantization number bit, support 4 and 8bit. Default: 8.
+        num_bits (int): The bit number of quantization, supporting 4 and 8bits. Default: 8.
        symmetric (bool): The quantization algorithm is symmetric or not. Default: False.
        narrow_range (bool): The quantization algorithm uses narrow range or not. Default: False.
        quant_delay (int): The Quantization delay parameters according to the global step. Default: 0.
@ -614,7 +614,7 @@ class Conv2dBnWithoutFoldQuant(Cell):
            Default: 'normal'.
        bias_init (Union[Tensor, str, Initializer, numbers.Number]): Initializer for the bias vector. Default: 'zeros'.
        per_channel (bool): FakeQuantWithMinMax Parameters. Default: False.
-        num_bits (int): The quantization number bit, support 4 and 8bit. Default: 8.
+        num_bits (int): The bit number of quantization, supporting 4 and 8bits. Default: 8.
        symmetric (bool): The quantization algorithm is symmetric or not. Default: False.
        narrow_range (bool): The quantization algorithm uses narrow range or not. Default: False.
        quant_delay (int): Quantization delay parameters according to the global step. Default: 0.
@ -736,7 +736,7 @@ class Conv2dQuant(Cell):
            Default: 'normal'.
        bias_init (Union[Tensor, str, Initializer, numbers.Number]): Initializer for the bias vector. Default: 'zeros'.
        per_channel (bool): FakeQuantWithMinMax Parameters. Default: False.
-        num_bits (int): The quantization number bit, support 4 and 8bit. Default: 8.
+        num_bits (int): The bit number of quantization, supporting 4 and 8bits. Default: 8.
        symmetric (bool): The quantization algorithm is symmetric or not. Default: False.
        narrow_range (bool): The quantization algorithm uses narrow range or not. Default: False.
        quant_delay (int): Quantization delay parameters according to the global step. Default: 0.
@ -845,7 +845,7 @@ class DenseQuant(Cell):
        has_bias (bool): Specifies whether the layer uses a bias vector. Default: True.
        activation (str): The regularization function applied to the output of the layer, eg. 'relu'. Default: None.
        per_channel (bool): FakeQuantWithMinMax Parameters. Default: False.
-        num_bits (int): The quantization number bit, support 4 and 8bit. Default: 8.
+        num_bits (int): The bit number of quantization, supporting 4 and 8bits. Default: 8.
        symmetric (bool): The quantization algorithm is symmetric or not. Default: False.
        narrow_range (bool): The quantization algorithm uses narrow range or not. Default: False.
        quant_delay (int): Quantization delay parameters according to the global step. Default: 0.
@ -947,15 +947,14 @@ class ActQuant(_QuantActivation):
    r"""
    Quantization aware training activation function.

-    Add Fake Quant OP after activation. Not Recommand to used these cell for Fake Quant Op
-    Will climp the max range of the activation and the relu6 do the same operation.
-    This part is a more detailed overview of ReLU6 op.
+    Add the fake quant op to the end of activation op, by which the output of activation op will be truncated.
+    Please check `FakeQuantWithMinMax` for more details.

    Args:
        activation (Cell): Activation cell class.
        ema_decay (float): Exponential Moving Average algorithm parameter. Default: 0.999.
        per_channel (bool):  Quantization granularity based on layer or on channel. Default: False.
-        num_bits (int): The quantization number bit, support 4 and 8bit. Default: 8.
+        num_bits (int): The bit number of quantization, supporting 4 and 8bits. Default: 8.
        symmetric (bool): The quantization algorithm is symmetric or not. Default: False.
        narrow_range (bool): The quantization algorithm uses narrow range or not. Default: False.
        quant_delay (int): Quantization delay parameters according to the global steps. Default: 0.
@ -1010,7 +1009,7 @@ class LeakyReLUQuant(_QuantActivation):
        activation (Cell): Activation cell class.
        ema_decay (float): Exponential Moving Average algorithm parameter. Default: 0.999.
        per_channel (bool):  Quantization granularity based on layer or on channel. Default: False.
-        num_bits (int): The quantization number bit, support 4 and 8bit. Default: 8.
+        num_bits (int): The bit number of quantization, supporting 4 and 8bits. Default: 8.
        symmetric (bool): The quantization algorithm is symmetric or not. Default: False.
        narrow_range (bool): The quantization algorithm uses narrow range or not. Default: False.
        quant_delay (int): Quantization delay parameters according to the global step. Default: 0.
@ -1080,9 +1079,9 @@ class HSwishQuant(_QuantActivation):
        activation (Cell): Activation cell class.
        ema_decay (float): Exponential Moving Average algorithm parameter. Default: 0.999.
        per_channel (bool):  Quantization granularity based on layer or on channel. Default: False.
-        num_bits (int): The quantization number bit, support 4 and 8bit. Default: 8.
-        symmetric (bool): The quantization algorithm is symmetric or not. Default: False.
-        narrow_range (bool): The quantization algorithm uses narrow range or not. Default: False.
+        num_bits (int): The bit number of quantization, supporting 4 and 8bits. Default: 8.
+        symmetric (bool): Whether the quantization algorithm is symmetric or not. Default: False.
+        narrow_range (bool): Whether the quantization algorithm uses narrow range or not. Default: False.
        quant_delay (int): Quantization delay parameters according to the global step. Default: 0.

    Inputs:
@ -1149,9 +1148,9 @@ class HSigmoidQuant(_QuantActivation):
        activation (Cell): Activation cell class.
        ema_decay (float): Exponential Moving Average algorithm parameter. Default: 0.999.
        per_channel (bool):  Quantization granularity based on layer or on channel. Default: False.
-        num_bits (int): The quantization number bit, support 4 and 8bit. Default: 8.
-        symmetric (bool): The quantization algorithm is symmetric or not. Default: False.
-        narrow_range (bool): The quantization algorithm uses narrow range or not. Default: False.
+        num_bits (int): The bit number of quantization, supporting 4 and 8bits. Default: 8.
+        symmetric (bool): Whether the quantization algorithm is symmetric or not. Default: False.
+        narrow_range (bool): Whether the quantization algorithm uses narrow range or not. Default: False.
        quant_delay (int): Quantization delay parameters according to the global step. Default: 0.

    Inputs:
@ -1217,7 +1216,7 @@ class TensorAddQuant(Cell):
    Args:
        ema_decay (float): Exponential Moving Average algorithm parameter. Default: 0.999.
        per_channel (bool):  Quantization granularity based on layer or on channel. Default: False.
-        num_bits (int): The quantization number bit, support 4 and 8bit. Default: 8.
+        num_bits (int): The bit number of quantization, supporting 4 and 8bits. Default: 8.
        symmetric (bool): The quantization algorithm is symmetric or not. Default: False.
        narrow_range (bool): The quantization algorithm uses narrow range or not. Default: False.
        quant_delay (int): Quantization delay parameters according to the global step. Default: 0.
@ -1269,7 +1268,7 @@ class MulQuant(Cell):
    Args:
        ema_decay (float): Exponential Moving Average algorithm parameter. Default: 0.999.
        per_channel (bool):  Quantization granularity based on layer or on channel. Default: False.
-        num_bits (int): The quantization number bit, support 4 and 8bit. Default: 8.
+        num_bits (int): The bit number of quantization, supporting 4 and 8bits. Default: 8.
        symmetric (bool): The quantization algorithm is symmetric or not. Default: False.
        narrow_range (bool): The quantization algorithm uses narrow range or not. Default: False.
        quant_delay (int): Quantization delay parameters according to the global step. Default: 0.
--- a/mindspore/nn/loss/loss.py
+++ b/mindspore/nn/loss/loss.py
@ -80,7 +80,7 @@ class L1Loss(_Loss):
    When argument reduction is 'sum', the sum of :math:`L(x, y)` will be returned. :math:`N` is the batch size.

    Args:
-        reduction (str): Type of reduction to apply to loss. The optional values are "mean", "sum", "none".
+        reduction (str): Type of reduction to be applied to loss. The optional values are "mean", "sum", and "none".
            Default: "mean".

    Inputs:
@ -107,7 +107,7 @@ class L1Loss(_Loss):

 class MSELoss(_Loss):
    r"""
-    MSELoss create a criterion to measures the mean squared error (squared L2-norm) between :math:`x` and :math:`y`
+    MSELoss creates a criterion to measure the mean squared error (squared L2-norm) between :math:`x` and :math:`y`
    by element, where :math:`x` is the input and :math:`y` is the target.

    For simplicity, let :math:`x` and :math:`y` be 1-dimensional Tensor with length :math:`N`,
@ -120,7 +120,7 @@ class MSELoss(_Loss):
    When argument reduction is 'sum', the sum of :math:`L(x, y)` will be returned. :math:`N` is the batch size.

    Args:
-        reduction (str): Type of reduction to apply to loss. The optional values are "mean", "sum", "none".
+        reduction (str): Type of reduction to be applied to loss. The optional values are "mean", "sum", and "none".
            Default: "mean".

    Inputs:
@ -210,14 +210,14 @@ class SoftmaxCrossEntropyWithLogits(_Loss):

    Note:
        While the target classes are mutually exclusive, i.e., only one class is positive in the target, the predicted
-        probabilities need not be exclusive. All that is required is that the predicted probability distribution
+        probabilities need not to be exclusive. It is only required that the predicted probability distribution
        of entry is a valid one.

    Args:
        is_grad (bool): Specifies whether calculate grad only. Default: True.
        sparse (bool): Specifies whether labels use sparse format or not. Default: False.
-        reduction (Union[str, None]): Type of reduction to apply to loss. Support 'sum' or 'mean' If None,
-            do not reduction. Default: None.
+        reduction (Union[str, None]): Type of reduction to be applied to loss. Support 'sum' and 'mean'. If None,
+            do not perform reduction. Default: None.
        smooth_factor (float): Label smoothing factor. It is a optional input which should be in range [0, 1].
            Default: 0.
        num_classes (int): The number of classes in the task. It is a optional input Default: 2.
@ -225,7 +225,7 @@ class SoftmaxCrossEntropyWithLogits(_Loss):
    Inputs:
        - **logits** (Tensor) - Tensor of shape (N, C).
        - **labels** (Tensor) - Tensor of shape (N, ). If `sparse` is True, The type of
-          `labels` is mindspore.int32. If `sparse` is False, the type of `labels` is same as the type of `logits`.
+          `labels` is mindspore.int32. If `sparse` is False, the type of `labels` is the same as the type of `logits`.

    Outputs:
        Tensor, a tensor of the same shape as logits with the component-wise
@ -282,8 +282,8 @@ class SoftmaxCrossEntropyExpand(Cell):
    where :math:`x_i` is a 1D score Tensor, :math:`t_i` is the target class.

    Note:
-        When argument sparse is set to True, the format of label is the index
-        range from :math:`0` to :math:`C - 1` instead of one-hot vectors.
+        When argument sparse is set to True, the format of the label is the index
+        ranging from :math:`0` to :math:`C - 1` instead of one-hot vectors.

    Args:
        sparse(bool): Specifies whether labels use sparse format or not. Default: False.
--- a/mindspore/nn/metrics/init.py
+++ b/mindspore/nn/metrics/init.py
@ -69,7 +69,7 @@ def names():

 def get_metric_fn(name, *args, **kwargs):
    """
-    Gets the metric method base on the input name.
+    Gets the metric method based on the input name.

    Args:
        name (str): The name of metric method. Refer to the '__factory__'
--- a/mindspore/nn/metrics/metric.py
+++ b/mindspore/nn/metrics/metric.py
@ -82,7 +82,7 @@ class Metric(metaclass=ABCMeta):
    @abstractmethod
    def clear(self):
        """
-        A interface describes the behavior of clearing the internal evaluation result.
+        An interface describes the behavior of clearing the internal evaluation result.

        Note:
            All subclasses should override this interface.
@ -92,7 +92,7 @@ class Metric(metaclass=ABCMeta):
    @abstractmethod
    def eval(self):
        """
-        A interface describes the behavior of computing the evaluation result.
+        An interface describes the behavior of computing the evaluation result.

        Note:
            All subclasses should override this interface.
@ -102,7 +102,7 @@ class Metric(metaclass=ABCMeta):
    @abstractmethod
    def update(self, *inputs):
        """
-        A interface describes the behavior of updating the internal evaluation result.
+        An interface describes the behavior of updating the internal evaluation result.

        Note:
            All subclasses should override this interface.
--- a/mindspore/nn/optim/adam.py
+++ b/mindspore/nn/optim/adam.py
@ -36,8 +36,8 @@ def _update_run_op(beta1, beta2, eps, lr, weight_decay, param, m, v, gradient, d
    Update parameters.

    Args:
-        beta1 (Tensor): The exponential decay rate for the 1st moment estimates. Should be in range (0.0, 1.0).
-        beta2 (Tensor): The exponential decay rate for the 2nd moment estimates. Should be in range (0.0, 1.0).
+        beta1 (Tensor): The exponential decay rate for the 1st moment estimations. Should be in range (0.0, 1.0).
+        beta2 (Tensor): The exponential decay rate for the 2nd moment estimations. Should be in range (0.0, 1.0).
        eps (Tensor): Term added to the denominator to improve numerical stability. Should be greater than 0.
        lr (Tensor): Learning rate.
        weight_decay (Number): Weight decay. Should be equal to or greater than 0.
@ -180,12 +180,12 @@ class Adam(Optimizer):
              the order will be followed in the optimizer. There are no other keys in the `dict` and the parameters
              which in the 'order_params' should be in one of group parameters.

-        learning_rate (Union[float, Tensor, Iterable, LearningRateSchedule]): A value or graph for the learning rate.
-            When the learning_rate is a Iterable or a Tensor with dimension of 1, use the dynamic learning rate, then
+        learning_rate (Union[float, Tensor, Iterable, LearningRateSchedule]): A value or a graph for the learning rate.
+            When the learning_rate is an Iterable or a Tensor in a 1D dimension, use the dynamic learning rate, then
            the i-th step will take the i-th value as the learning rate. When the learning_rate is LearningRateSchedule,
            use dynamic learning rate, the i-th learning rate will be calculated during the process of training
-            according to the formula of LearningRateSchedule. When the learning_rate is a float or a Tensor with
-            dimension of 0, use fixed learning rate. Other cases are not supported. The float learning rate should be
+            according to the formula of LearningRateSchedule. When the learning_rate is a float or a Tensor in a zero
+            dimension, use fixed learning rate. Other cases are not supported. The float learning rate should be
            equal to or greater than 0. If the type of `learning_rate` is int, it will be converted to float.
            Default: 1e-3.
        beta1 (float): The exponential decay rate for the 1st moment estimations. Should be in range (0.0, 1.0).
@ -195,11 +195,11 @@ class Adam(Optimizer):
        eps (float): Term added to the denominator to improve numerical stability. Should be greater than 0. Default:
                     1e-8.
        use_locking (bool): Whether to enable a lock to protect updating variable tensors.
-            If True, updating of the var, m, and v tensors will be protected by a lock.
-            If False, the result is unpredictable. Default: False.
+            If true, updates of the var, m, and v tensors will be protected by a lock.
+            If false, the result is unpredictable. Default: False.
        use_nesterov (bool): Whether to use Nesterov Accelerated Gradient (NAG) algorithm to update the gradients.
-            If True, update the gradients using NAG.
-            If False, update the gradients without using NAG. Default: False.
+            If true, update the gradients using NAG.
+            If false, update the gradients without using NAG. Default: False.
        weight_decay (float): Weight decay (L2 penalty). It should be equal to or greater than 0. Default: 0.0.
        loss_scale (float): A floating point value for the loss scale. Should be greater than 0. Default: 1.0.

@ -304,12 +304,12 @@ class AdamWeightDecay(Optimizer):
              the order will be followed in the optimizer. There are no other keys in the `dict` and the parameters
              which in the 'order_params' should be in one of group parameters.

-        learning_rate (Union[float, Tensor, Iterable, LearningRateSchedule]): A value or graph for the learning rate.
-            When the learning_rate is a Iterable or a Tensor with dimension of 1, use the dynamic learning rate, then
+        learning_rate (Union[float, Tensor, Iterable, LearningRateSchedule]): A value or a graph for the learning rate.
+            When the learning_rate is an Iterable or a Tensor in a 1D dimension, use the dynamic learning rate, then
            the i-th step will take the i-th value as the learning rate. When the learning_rate is LearningRateSchedule,
            use dynamic learning rate, the i-th learning rate will be calculated during the process of training
-            according to the formula of LearningRateSchedule. When the learning_rate is a float or a Tensor with
-            dimension of 0, use fixed learning rate. Other cases are not supported. The float learning rate should be
+            according to the formula of LearningRateSchedule. When the learning_rate is a float or a Tensor in a zero
+            dimension, use fixed learning rate. Other cases are not supported. The float learning rate should be
            equal to or greater than 0. If the type of `learning_rate` is int, it will be converted to float.
            Default: 1e-3.
        beta1 (float): The exponential decay rate for the 1st moment estimations. Default: 0.9.
--- a/mindspore/nn/optim/ftrl.py
+++ b/mindspore/nn/optim/ftrl.py
@ -114,12 +114,12 @@ class FTRL(Optimizer):
            than or equal to zero. Use fixed learning rate if lr_power is zero. Default: -0.5.
        l1 (float): l1 regularization strength, must be greater than or equal to zero. Default: 0.0.
        l2 (float): l2 regularization strength, must be greater than or equal to zero. Default: 0.0.
-        use_locking (bool): If True use locks for update operation. Default: False.
+        use_locking (bool): If True, use locks for updating operation. Default: False.
        loss_scale (float): Value for the loss scale. It should be equal to or greater than 1.0. Default: 1.0.
        weight_decay (float): Weight decay value to multiply weight, must be zero or positive value. Default: 0.0.

    Inputs:
-        - **grads** (tuple[Tensor]) - The gradients of `params` in optimizer, the shape is as same as the `params`
+        - **grads** (tuple[Tensor]) - The gradients of `params` in the optimizer, the shape is the same as the `params`
          in optimizer.

    Outputs:
--- a/mindspore/nn/optim/lamb.py
+++ b/mindspore/nn/optim/lamb.py
@ -39,8 +39,8 @@ def _update_run_op(beta1, beta2, eps, global_step, lr, weight_decay, param, m, v
    Update parameters.

    Args:
-        beta1 (Tensor): The exponential decay rate for the 1st moment estimates. Should be in range (0.0, 1.0).
-        beta2 (Tensor): The exponential decay rate for the 2nd moment estimates. Should be in range (0.0, 1.0).
+        beta1 (Tensor): The exponential decay rate for the 1st moment estimations. Should be in range (0.0, 1.0).
+        beta2 (Tensor): The exponential decay rate for the 2nd moment estimations. Should be in range (0.0, 1.0).
        eps (Tensor): Term added to the denominator to improve numerical stability. Should be greater than 0.
        lr (Tensor): Learning rate.
        weight_decay (Number): Weight decay. Should be equal to or greater than 0.
@ -122,8 +122,8 @@ def _update_run_op_graph_kernel(beta1, beta2, eps, global_step, lr, weight_decay
    Update parameters.

    Args:
-        beta1 (Tensor): The exponential decay rate for the 1st moment estimates. Should be in range (0.0, 1.0).
-        beta2 (Tensor): The exponential decay rate for the 2nd moment estimates. Should be in range (0.0, 1.0).
+        beta1 (Tensor): The exponential decay rate for the 1st moment estimations. Should be in range (0.0, 1.0).
+        beta2 (Tensor): The exponential decay rate for the 2nd moment estimations. Should be in range (0.0, 1.0).
        eps (Tensor): Term added to the denominator to improve numerical stability. Should be greater than 0.
        lr (Tensor): Learning rate.
        weight_decay (Number): Weight decay. Should be equal to or greater than 0.
@ -184,7 +184,7 @@ def _check_param_value(beta1, beta2, eps, prim_name):

 class Lamb(Optimizer):
    """
-    Lamb Dynamic LR.
+    Lamb Dynamic Learning Rate.

    LAMB is an optimization algorithm employing a layerwise adaptive large batch
    optimization technique. Refer to the paper `LARGE BATCH OPTIMIZATION FOR DEEP LEARNING: TRAINING BERT IN 76
@ -214,16 +214,16 @@ class Lamb(Optimizer):
              the order will be followed in optimizer. There are no other keys in the `dict` and the parameters which
              in the value of 'order_params' should be in one of group parameters.

-        learning_rate (Union[float, Tensor, Iterable, LearningRateSchedule]): A value or graph for the learning rate.
-            When the learning_rate is a Iterable or a Tensor with dimension of 1, use dynamic learning rate, then
+        learning_rate (Union[float, Tensor, Iterable, LearningRateSchedule]): A value or a graph for the learning rate.
+            When the learning_rate is an Iterable or a Tensor in a 1D dimension, use dynamic learning rate, then
            the i-th step will take the i-th value as the learning rate. When the learning_rate is LearningRateSchedule,
            use dynamic learning rate, the i-th learning rate will be calculated during the process of training
-            according to the formula of LearningRateSchedule. When the learning_rate is a float or a Tensor with
-            dimension of 0, use fixed learning rate. Other cases are not supported. The float learning rate should be
+            according to the formula of LearningRateSchedule. When the learning_rate is a float or a Tensor in a zero
+            dimension, use fixed learning rate. Other cases are not supported. The float learning rate should be
            equal to or greater than 0. If the type of `learning_rate` is int, it will be converted to float.
-        beta1 (float): The exponential decay rate for the 1st moment estimates. Default: 0.9.
+        beta1 (float): The exponential decay rate for the 1st moment estimations. Default: 0.9.
            Should be in range (0.0, 1.0).
-        beta2 (float): The exponential decay rate for the 2nd moment estimates. Default: 0.999.
+        beta2 (float): The exponential decay rate for the 2nd moment estimations. Default: 0.999.
            Should be in range (0.0, 1.0).
        eps (float): Term added to the denominator to improve numerical stability. Default: 1e-6.
            Should be greater than 0.
--- a/mindspore/nn/optim/lars.py
+++ b/mindspore/nn/optim/lars.py
@ -58,12 +58,12 @@ class LARS(Optimizer):
        epsilon (float): Term added to the denominator to improve numerical stability. Default: 1e-05.
        coefficient (float): Trust coefficient for calculating the local learning rate. Default: 0.001.
        use_clip (bool): Whether to use clip operation for calculating the local learning rate. Default: False.
-        lars_filter (Function): A function to determine whether apply lars algorithm. Default:
+        lars_filter (Function): A function to determine whether apply the LARS algorithm. Default:
                                lambda x: 'LayerNorm' not in x.name and 'bias' not in x.name.

    Inputs:
-        - **gradients** (tuple[Tensor]) - The gradients of `params` in optimizer, the shape is
-          as same as the `params` in optimizer.
+        - **gradients** (tuple[Tensor]) - The gradients of `params` in the optimizer, the shape is the
+          as same as the `params` in the optimizer.

    Outputs:
        Union[Tensor[bool], tuple[Parameter]], it depends on the output of `optimizer`.
--- a/mindspore/nn/optim/lazyadam.py
+++ b/mindspore/nn/optim/lazyadam.py
@ -127,26 +127,26 @@ class LazyAdam(Optimizer):
              the order will be followed in optimizer. There are no other keys in the `dict` and the parameters which
              in the value of 'order_params' should be in one of group parameters.

-        learning_rate (Union[float, Tensor, Iterable, LearningRateSchedule]): A value or graph for the learning rate.
-            When the learning_rate is a Iterable or a Tensor with dimension of 1, use dynamic learning rate, then
+        learning_rate (Union[float, Tensor, Iterable, LearningRateSchedule]): A value or a graph for the learning rate.
+            When the learning_rate is an Iterable or a Tensor in a 1D dimension, use dynamic learning rate, then
            the i-th step will take the i-th value as the learning rate. When the learning_rate is LearningRateSchedule,
            use dynamic learning rate, the i-th learning rate will be calculated during the process of training
-            according to the formula of LearningRateSchedule. When the learning_rate is a float or a Tensor with
-            dimension of 0, use fixed learning rate. Other cases are not supported. The float learning rate should be
+            according to the formula of LearningRateSchedule. When the learning_rate is a float or a Tensor in a zero
+            dimension, use fixed learning rate. Other cases are not supported. The float learning rate should be
            equal to or greater than 0. If the type of `learning_rate` is int, it will be converted to float.
            Default: 1e-3.
-        beta1 (float): The exponential decay rate for the 1st moment estimates. Should be in range (0.0, 1.0). Default:
-                       0.9.
-        beta2 (float): The exponential decay rate for the 2nd moment estimates. Should be in range (0.0, 1.0). Default:
-                       0.999.
+        beta1 (float): The exponential decay rate for the 1st moment estimations. Should be in range (0.0, 1.0).
+                       Default: 0.9.
+        beta2 (float): The exponential decay rate for the 2nd moment estimations. Should be in range (0.0, 1.0).
+                       Default: 0.999.
        eps (float): Term added to the denominator to improve numerical stability. Should be greater than 0. Default:
                     1e-8.
        use_locking (bool): Whether to enable a lock to protect updating variable tensors.
-            If True, updating of the var, m, and v tensors will be protected by a lock.
-            If False, the result is unpredictable. Default: False.
+            If true, updates of the var, m, and v tensors will be protected by a lock.
+            If false, the result is unpredictable. Default: False.
        use_nesterov (bool): Whether to use Nesterov Accelerated Gradient (NAG) algorithm to update the gradients.
-            If True, updates the gradients using NAG.
-            If False, updates the gradients without using NAG. Default: False.
+            If true, update the gradients using NAG.
+            If true, update the gradients without using NAG. Default: False.
        weight_decay (float): Weight decay (L2 penalty). Default: 0.0.
        loss_scale (float): A floating point value for the loss scale. Should be equal to or greater than 1. Default:
                            1.0.
--- a/mindspore/nn/optim/momentum.py
+++ b/mindspore/nn/optim/momentum.py
@ -83,12 +83,12 @@ class Momentum(Optimizer):
              the order will be followed in optimizer. There are no other keys in the `dict` and the parameters which
              in the value of 'order_params' should be in one of group parameters.

-        learning_rate (Union[float, Tensor, Iterable, LearningRateSchedule]): A value or graph for the learning rate.
-            When the learning_rate is a Iterable or a Tensor with dimension of 1, use dynamic learning rate, then
+        learning_rate (Union[float, Tensor, Iterable, LearningRateSchedule]): A value or a graph for the learning rate.
+            When the learning_rate is an Iterable or a Tensor in a 1D dimension, use dynamic learning rate, then
            the i-th step will take the i-th value as the learning rate. When the learning_rate is LearningRateSchedule,
            use dynamic learning rate, the i-th learning rate will be calculated during the process of training
-            according to the formula of LearningRateSchedule. When the learning_rate is a float or a Tensor with
-            dimension of 0, use fixed learning rate. Other cases are not supported. The float learning rate should be
+            according to the formula of LearningRateSchedule. When the learning_rate is a float or a Tensor in a zero
+            dimension, use fixed learning rate. Other cases are not supported. The float learning rate should be
            equal to or greater than 0. If the type of `learning_rate` is int, it will be converted to float.
        momentum (float): Hyperparameter of type float, means momentum for the moving average.
            It should be at least 0.0.
--- a/mindspore/nn/optim/optimizer.py
+++ b/mindspore/nn/optim/optimizer.py
@ -40,8 +40,6 @@ class Optimizer(Cell):
    """
    Base class for all optimizers.

-    This class defines the API to add Ops to train a model.
-
    Note:
        This class defines the API to add Ops to train a model. Never use
        this class directly, but instead instantiate one of its subclasses.
@ -55,12 +53,12 @@ class Optimizer(Cell):
        To improve parameter groups performance, the customized order of parameters can be supported.

    Args:
-        learning_rate (Union[float, Tensor, Iterable, LearningRateSchedule]): A value or graph for the learning
-            rate. When the learning_rate is a Iterable or a Tensor with dimension of 1, use dynamic learning rate, then
+        learning_rate (Union[float, Tensor, Iterable, LearningRateSchedule]): A value or a graph for the learning
+            rate. When the learning_rate is an Iterable or a Tensor in a 1D dimension, use dynamic learning rate, then
            the i-th step will take the i-th value as the learning rate. When the learning_rate is LearningRateSchedule,
            use dynamic learning rate, the i-th learning rate will be calculated during the process of training
-            according to the formula of LearningRateSchedule. When the learning_rate is a float or a Tensor with
-            dimension of 0, use fixed learning rate. Other cases are not supported. The float learning rate should be
+            according to the formula of LearningRateSchedule. When the learning_rate is a float or a Tensor in a zero
+            dimension, use fixed learning rate. Other cases are not supported. The float learning rate should be
            equal to or greater than 0. If the type of `learning_rate` is int, it will be converted to float.
        parameters (Union[list[Parameter], list[dict]]): When the `parameters` is a list of `Parameter` which will be
            updated, the element in `parameters` should be class `Parameter`. When the `parameters` is a list of `dict`,
@ -84,8 +82,8 @@ class Optimizer(Cell):
            type of `loss_scale` input is int, it will be converted to float. Default: 1.0.

    Raises:
-        ValueError: If the learning_rate is a Tensor, but the dims of tensor is greater than 1.
-        TypeError: If the learning_rate is not any of the three types: float, Tensor, Iterable.
+        ValueError: If the learning_rate is a Tensor, but the dimension of tensor is greater than 1.
+        TypeError: If the learning_rate is not any of the three types: float, Tensor, nor Iterable.
    """

    def __init__(self, learning_rate, parameters, weight_decay=0.0, loss_scale=1.0):
@ -179,7 +177,7 @@ class Optimizer(Cell):
        An approach to reduce the overfitting of a deep learning neural network model.

        Args:
-            gradients (tuple[Tensor]): The gradients of `self.parameters`, and have the same shape with
+            gradients (tuple[Tensor]): The gradients of `self.parameters`, and have the same shape as
                `self.parameters`.

        Returns:
@ -204,7 +202,7 @@ class Optimizer(Cell):
        network.

        Args:
-            gradients (tuple[Tensor]): The gradients of `self.parameters`, and have the same shape with
+            gradients (tuple[Tensor]): The gradients of `self.parameters`, and have the same shape as
                `self.parameters`.

        Returns:
--- a/Show More
+++ b/Show More