Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into bi_tensor_prod_op
commit
f5cb52ca3e
@ -1 +1,157 @@
|
||||
./doc/howto/dev/contribute_to_paddle_en.md
|
||||
# Contribute Code
|
||||
|
||||
We sincerely appreciate your contribution. This document explains our workflow and work style.
|
||||
|
||||
## Workflow
|
||||
|
||||
PaddlePaddle uses this [Git branching model](http://nvie.com/posts/a-successful-git-branching-model/). The following steps guide usual contributions.
|
||||
|
||||
1. Fork
|
||||
|
||||
Our development community has been growing fastly; it doesn't make sense for everyone to write into the official repo. So, please file Pull Requests from your fork. To make a fork, just head over to the GitHub page and click the ["Fork" button](https://help.github.com/articles/fork-a-repo/).
|
||||
|
||||
1. Clone
|
||||
|
||||
To make a copy of your fork to your local computers, please run
|
||||
|
||||
```bash
|
||||
git clone https://github.com/your-github-account/paddle
|
||||
cd paddle
|
||||
```
|
||||
|
||||
1. Create the local feature branch
|
||||
|
||||
For daily works like adding a new feature or fixing a bug, please open your feature branch before coding:
|
||||
|
||||
```bash
|
||||
git checkout -b my-cool-stuff
|
||||
```
|
||||
|
||||
1. Commit
|
||||
|
||||
Before issuing your first `git commit` command, please install [`pre-commit`](http://pre-commit.com/) by running the following commands:
|
||||
|
||||
```bash
|
||||
pip install pre-commit
|
||||
pre-commit install
|
||||
```
|
||||
|
||||
Our pre-commit configuration requires clang-format 3.8 for auto-formating C/C++ code and yapf for Python.
|
||||
|
||||
Once installed, `pre-commit` checks the style of code and documentation in every commit. We will see something like the following when you run `git commit`:
|
||||
|
||||
```
|
||||
➜ git commit
|
||||
CRLF end-lines remover...............................(no files to check)Skipped
|
||||
yapf.................................................(no files to check)Skipped
|
||||
Check for added large files..............................................Passed
|
||||
Check for merge conflicts................................................Passed
|
||||
Check for broken symlinks................................................Passed
|
||||
Detect Private Key...................................(no files to check)Skipped
|
||||
Fix End of Files.....................................(no files to check)Skipped
|
||||
clang-formater.......................................(no files to check)Skipped
|
||||
[my-cool-stuff c703c041] add test file
|
||||
1 file changed, 0 insertions(+), 0 deletions(-)
|
||||
create mode 100644 233
|
||||
```
|
||||
|
||||
1. Build and test
|
||||
|
||||
Users can build PaddlePaddle natively on Linux and Mac OS X. But to unify the building environment and to make it easy for debugging, the recommended way is [using Docker](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/dev/build_en.md).
|
||||
|
||||
1. Keep pulling
|
||||
|
||||
An experienced Git user pulls from the official repo often -- daily or even hourly, so they notice conflicts with others work early, and it's easier to resolve smaller conflicts.
|
||||
|
||||
```bash
|
||||
git remote add upstream https://github.com/PaddlePaddle/Paddle
|
||||
git pull upstream develop
|
||||
```
|
||||
|
||||
1. Push and file a pull request
|
||||
|
||||
You can "push" your local work into your forked repo:
|
||||
|
||||
```bash
|
||||
git push origin my-cool-stuff
|
||||
```
|
||||
|
||||
The push allows you to create a pull request, requesting owners of this [official repo](https://github.com/PaddlePaddle/Paddle) to pull your change into the official one.
|
||||
|
||||
To create a pull request, please follow [these steps](https://help.github.com/articles/creating-a-pull-request/).
|
||||
|
||||
If your change is for fixing an issue, please write ["Fixes <issue-URL>"](https://help.github.com/articles/closing-issues-using-keywords/) in the description section of your pull request. Github would close the issue when the owners merge your pull request.
|
||||
|
||||
Please remember to specify some reviewers for your pull request. If you don't know who are the right ones, please follow Github's recommendation.
|
||||
|
||||
|
||||
1. Delete local and remote branches
|
||||
|
||||
To keep your local workspace and your fork clean, you might want to remove merged branches:
|
||||
|
||||
```bash
|
||||
git push origin :my-cool-stuff
|
||||
git checkout develop
|
||||
git pull upstream develop
|
||||
git branch -d my-cool-stuff
|
||||
```
|
||||
|
||||
### Code Review
|
||||
|
||||
- Please feel free to ping your reviewers by sending them the URL of your pull request via IM or email. Please do this after your pull request passes the CI.
|
||||
|
||||
- Please answer reviewers' every comment. If you are to follow the comment, please write "Done"; please give a reason otherwise.
|
||||
|
||||
- If you don't want your reviewers to get overwhelmed by email notifications, you might reply their comments by [in a batch](https://help.github.com/articles/reviewing-proposed-changes-in-a-pull-request/).
|
||||
|
||||
- Reduce the unnecessary commits. Some developers commit often. It is recommended to append a sequence of small changes into one commit by running `git commit --amend` instead of `git commit`.
|
||||
|
||||
|
||||
## Coding Standard
|
||||
|
||||
### Code Style
|
||||
|
||||
Our C/C++ code follows the [Google style guide](http://google.github.io/styleguide/cppguide.html).
|
||||
|
||||
Our Python code follows the [PEP8 style guide](https://www.python.org/dev/peps/pep-0008/).
|
||||
|
||||
Our build process helps to check the code style. In [`build.sh`](https://github.com/PaddlePaddle/Paddle/blob/b84e8226514b8bb4405c3c28e54aa5077193d179/paddle/scripts/docker/build.sh#L42), the entry point of our [builder Docker image](https://github.com/PaddlePaddle/Paddle/blob/b84e8226514b8bb4405c3c28e54aa5077193d179/Dockerfile#L88), the CMake argument `WITH_STYLE_CHECK` is set to `ON` by default. This flag is on
|
||||
|
||||
Please install pre-commit, which automatically reformat the changes to C/C++ and Python code whenever we run `git commit`. To check the whole codebase, we can run the command `pre-commit run -a`, as in the [`check_style.sh` file](https://github.com/PaddlePaddle/Paddle/blob/b84e8226514b8bb4405c3c28e54aa5077193d179/paddle/scripts/travis/check_style.sh#L30), which is invoked by [our Travis CI configuration](https://github.com/PaddlePaddle/Paddle/blob/b84e8226514b8bb4405c3c28e54aa5077193d179/.travis.yml#L43).
|
||||
|
||||
### Unit Tests
|
||||
|
||||
Please remember to add related unit tests.
|
||||
|
||||
- For C/C++ code, please follow [`google-test` Primer](https://github.com/google/googletest/blob/master/googletest/docs/Primer.md).
|
||||
|
||||
- For Python code, please use [Python's standard `unittest` package](http://pythontesting.net/framework/unittest/unittest-introduction/).
|
||||
|
||||
|
||||
### Writing Logs
|
||||
|
||||
We use [glog](https://github.com/google/glog) for logging in our C/C++ code.
|
||||
|
||||
For general information, please use `LOG`. For debug information, please use [`VLOG`](http://htmlpreview.github.io/?https://github.com/google/glog/blob/master/doc/glog.html#verbose). The reason is at [here](https://groups.google.com/a/chromium.org/d/msg/chromium-dev/3NDNd1KzXeY/AZKMMx37fdQJ).
|
||||
|
||||
`VLOG` requires a *verbose level* parameter. For example:
|
||||
|
||||
```c++
|
||||
VLOG(3) << "Operator FC is taking " << num_inputs << "inputs."
|
||||
```
|
||||
|
||||
When we run a PaddlePaddle application or test, we can specify a verbose threshold. For example:
|
||||
|
||||
```bash
|
||||
GLOG_vmodule=buddy_allocator=2 \
|
||||
GLOG_v=10 \
|
||||
python \
|
||||
../python/paddle/v2/framework/tests/test_recurrent_op.py
|
||||
```
|
||||
|
||||
This will enable VLOG messages generated by `buddy_allocator.{h,cc}` and in the verbose range of 0 to 3, so you will see above example VLOG message, which is in level 3. This suggests that we output overall messages in lower verbose levels, so they display with higher probability. When coding C++, please follow the verbose level convention as follows:
|
||||
|
||||
- verbose level 1: [framework](https://github.com/PaddlePaddle/Paddle/tree/develop/paddle/framework)
|
||||
- verbose level 3: [operators](https://github.com/PaddlePaddle/Paddle/tree/develop/paddle/operators)
|
||||
- verbose level 5: [memory](https://github.com/PaddlePaddle/Paddle/tree/develop/paddle/memory), [platform](https://github.com/PaddlePaddle/Paddle/tree/develop/paddle/platform)
|
||||
- verbose level 7: [math](https://github.com/PaddlePaddle/Paddle/tree/develop/paddle/math)
|
||||
|
@ -0,0 +1,48 @@
|
||||
# Benchmark
|
||||
|
||||
Machine:
|
||||
|
||||
- Server
|
||||
- Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz, 2 Sockets, 20 Cores per socket
|
||||
- Laptop
|
||||
- DELL XPS15-9560-R1745: i7-7700HQ 8G 256GSSD
|
||||
- i5 MacBook Pro (Retina, 13-inch, Early 2015)
|
||||
- Desktop
|
||||
- i7-6700k
|
||||
|
||||
System: CentOS release 6.3 (Final), Docker 1.12.1.
|
||||
|
||||
PaddlePaddle: paddlepaddle/paddle:latest (TODO: will rerun after 0.11.0)
|
||||
|
||||
- MKL-DNN tag v0.10
|
||||
- MKLML 2018.0.20170720
|
||||
- OpenBLAS v0.2.20
|
||||
|
||||
On each machine, we will test and compare the performance of training on single node using MKL-DNN / MKLML / OpenBLAS respectively.
|
||||
|
||||
## Benchmark Model
|
||||
|
||||
### Server
|
||||
Test on batch size 64, 128, 256 on Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz
|
||||
|
||||
Input image size - 3 * 224 * 224, Time: images/second
|
||||
|
||||
- VGG-19
|
||||
|
||||
| BatchSize | 64 | 128 | 256 |
|
||||
|--------------|-------| -----| --------|
|
||||
| OpenBLAS | 7.82 | 8.62 | 10.34 |
|
||||
| MKLML | 11.02 | 12.86 | 15.33 |
|
||||
| MKL-DNN | 27.69 | 28.8 | 29.27 |
|
||||
|
||||
|
||||
chart on batch size 128
|
||||
TBD
|
||||
|
||||
- ResNet
|
||||
- GoogLeNet
|
||||
|
||||
### Laptop
|
||||
TBD
|
||||
### Desktop
|
||||
TBD
|
@ -0,0 +1,67 @@
|
||||
# Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
if(NOT WITH_GPU)
|
||||
return()
|
||||
endif()
|
||||
|
||||
include(ExternalProject)
|
||||
|
||||
set(NCCL_SOURCE_DIR ${THIRD_PARTY_PATH}/nccl)
|
||||
|
||||
include_directories(${NCCL_SOURCE_DIR}/src/extern_nccl/src)
|
||||
|
||||
if(WITH_DSO)
|
||||
# If we use DSO, we do not build nccl, just download the dependencies
|
||||
set(NCCL_BUILD_COMMAND "")
|
||||
set(NCCL_INSTALL_COMMAND "")
|
||||
set(NCCL_INSTALL_DIR "")
|
||||
else()
|
||||
# otherwise, we build nccl and link it.
|
||||
set(NCCL_INSTALL_DIR ${THIRD_PARTY_PATH}/install/nccl)
|
||||
# Note: cuda 8.0 is needed to make nccl
|
||||
# When cuda is not installed on the system directory, need to set CUDA_HOME to your cuda root
|
||||
set(NCCL_BUILD_COMMAND "make -j 8")
|
||||
set(NCCL_INSTALL_COMMAND "make install PREFIX=${NCCL_INSTALL_DIR}")
|
||||
endif()
|
||||
|
||||
ExternalProject_Add(
|
||||
extern_nccl
|
||||
${EXTERNAL_PROJECT_LOG_ARGS}
|
||||
GIT_REPOSITORY "https://github.com/NVIDIA/nccl.git"
|
||||
GIT_TAG "v1.3.4-1"
|
||||
PREFIX "${NCCL_SOURCE_DIR}"
|
||||
UPDATE_COMMAND ""
|
||||
CONFIGURE_COMMAND ""
|
||||
BUILD_COMMAND "${NCCL_BUILD_COMMAND}"
|
||||
INSTALL_COMMAND "${NCCL_INSTALL_COMMAND}"
|
||||
INSTALL_DIR "${NCCL_INSTALL_DIR}"
|
||||
TEST_COMMAND ""
|
||||
)
|
||||
|
||||
if(WITH_DSO)
|
||||
if(${CMAKE_VERSION} VERSION_LESS "3.3.0")
|
||||
set(dummyfile ${CMAKE_CURRENT_BINARY_DIR}/lib_nccl_dummy.c)
|
||||
file(WRITE ${dummyfile} "const char * dummy_nccl = \"${dummyfile}\";")
|
||||
add_library(nccl STATIC ${dummyfile})
|
||||
else()
|
||||
add_library(nccl INTERFACE)
|
||||
endif()
|
||||
else()
|
||||
add_library(nccl STATIC IMPORTED GLOBAL)
|
||||
set_property(TARGET nccl PROPERTY IMPORTED_LOCATION
|
||||
${NCCL_INSTALL_DIR}/lib/libnccl_static.a)
|
||||
endif()
|
||||
|
||||
add_dependencies(nccl extern_nccl)
|
@ -0,0 +1,60 @@
|
||||
# Design Doc: float16
|
||||
|
||||
## Why float16
|
||||
Half precision (float16) is a binary floating-point format that occupies 16 bits in memory. float16 is half the size of traditional 32-bit single precision format (float) and has lower precision and smaller range.
|
||||
|
||||
When high precision computation is not required, using float16 data type could potentially
|
||||
|
||||
- reduce storage space, memory bandwidth, and power usages;
|
||||
- increase the chance of data fitting into a smaller cache of lower latency;
|
||||
- provide arithmetic speed up if supported by hardware.
|
||||
|
||||
## Survey of current float16 support
|
||||
A brief survey of float16 support on different compilers, hardwares, and libraries can be found below. Interested readers can refer to [link1](https://github.com/PaddlePaddle/Paddle/issues/4853) and [link2](https://github.com/Xreki/Xreki.github.io/blob/master/multi_data_types_in_dl_framework/ppt/float16_and_quantized_type.md) for more info.
|
||||
|
||||
The goal of float16 is to serve as a key for the executor to find and run the correct version of compute method specialized for float16 in operator kernel. It should be compatible with various natively supported float16 implementations including `__half` for cuda, `float16_t` for ARM, and `Eigen::half` for Eigen to make writing customized float16 kernels easier.
|
||||
|
||||
### Compiler
|
||||
- nvcc supports `__half` data type after CUDA 7.5.
|
||||
- `__fp16` or `float16_t` is supported as storage type for gcc >= 6.1 and clang >= 3.4.
|
||||
- `__fp16` or `float16_t` is supported as arithmetic type for gcc >= 7.1 and clang >= 3.9.
|
||||
|
||||
### Hardware
|
||||
- `__half` is supported on GPU with compute capability >= 5.3.
|
||||
- `__fp16` is supported as storage type for ARMv7-A, ARMv8-A, and above.
|
||||
- `__fp16` is supported as arithmetic type after ARMv8.2-A (currently, the only microarchitecture implementing ARMv8.2-A is ARM Cortex-A75, which is announced in May 2017. There seems to be no application processors currently available on market that adopts this architecture. It is reported that Qualcomm Snapdragon 845 uses Cortex-A75 design and will be available in mobile devices in early 2018).
|
||||
|
||||
### Libraries
|
||||
- [Eigen](https://github.com/RLovelett/eigen) >= 3.3 supports float16 calculation on both GPU and CPU using the `Eigen::half` class. It is mostly useful for Nvidia GPUs because of the overloaded arithmetic operators using cuda intrinsics. It falls back to using software emulation on CPU for calculation and there is no special treatment to ARM processors.
|
||||
- [ARM compute library](https://github.com/ARM-software/ComputeLibrary) >= 17.02.01 supports NEON FP16 kernels (requires ARMv8.2-A CPU).
|
||||
|
||||
|
||||
## Implementation
|
||||
The float16 class holds a 16-bit `uint16_t` data internally.
|
||||
```
|
||||
struct float16 {
|
||||
uint16_t x;
|
||||
};
|
||||
```
|
||||
|
||||
float16 supports the following features:
|
||||
- constructors / assignment operators that take input from primitive data types including bool, integers of various length, float, and double.
|
||||
- constructors / assignment operators that take input from `__half` on cuda, `float16_t` on ARM, and `Eigen::half` on Eigen.
|
||||
- conversion operators to primitive data types and half precision data types on cuda, ARM and Eigen.
|
||||
- overloaded arithmetic operators for cuda, arm, and non-arm cpu, respectively. These operators will take advantage of the cuda and ARM intrinsics on the corresponding hardware.
|
||||
|
||||
To support the above features, two fundamental conversion functions are provided:
|
||||
```
|
||||
float16 float_to_half_rn(float f); // convert to half precision in round-to-nearest-even mode
|
||||
float half_to_float(float16 h);
|
||||
```
|
||||
which provides one-to-one conversion between float32 and float16. These twos functions will do different conversion routines based on the current hardware. CUDA/ARM instrinsics will be used when the corresonding hardware is available. If the hardware or compiler level does not support float32 to float16 conversion, software emulation will be performed to do the conversion.
|
||||
|
||||
## To do
|
||||
After float16 class is available, some of the future items are below:
|
||||
|
||||
- Update pybind/tensor_py.h to bind c++ float16 with numpy float16.
|
||||
|
||||
- Modify `IndicateDataType()` method in `framework/operator.h` to make it compatible with float16.
|
||||
|
||||
- Create a type-casting operator that can convert the data type in tensor between float16 and other types.
|
@ -0,0 +1,232 @@
|
||||
## Survey on Graph
|
||||
|
||||
Neural network framework often provides symbolic API for users to write network topology conveniently. This doc manily focus on symbolic API in most popular neural network frameworks, and try to find out how to parse symbolic configuration to a portable file, such as protobuf or json.
|
||||
|
||||
### Mxnet
|
||||
|
||||
The core concept of symbolic API is `Symbol`. Mxnet implements `Symbol` class in C++, and export to Python using C-API. Please refer to the comments in Mxnet:
|
||||
|
||||
|
||||
`Symbol` is help class used to represent the operator node in Graph.
|
||||
`Symbol` acts as an interface for building graphs from different components like Variable, Functor and Group. `Symbol` is also exported to python front-end (while Graph is not) to enable quick test and deployment. Conceptually, symbol is the final operation of a graph and thus including all the information required (the graph) to evaluate its output value.
|
||||
|
||||
|
||||
A simple network topology wrote by Symbol is as follows:
|
||||
|
||||
```python
|
||||
def get_symbol(num_classes=10, **kwargs):
|
||||
data = mx.symbol.Variable('data')
|
||||
data = mx.symbol.Flatten(data=data)
|
||||
fc1 = mx.symbol.FullyConnected(data = data, name='fc1', num_hidden=128)
|
||||
act1 = mx.symbol.Activation(data = fc1, name='relu1', act_type="relu")
|
||||
fc2 = mx.symbol.FullyConnected(data = act1, name = 'fc2', num_hidden = 64)
|
||||
act2 = mx.symbol.Activation(data = fc2, name='relu2', act_type="relu")
|
||||
fc3 = mx.symbol.FullyConnected(data = act2, name='fc3', num_hidden=num_classes)
|
||||
mlp = mx.symbol.SoftmaxOutput(data = fc3, name = 'softmax')
|
||||
return mlp
|
||||
```
|
||||
|
||||
|
||||
|
||||
Varible here is actually a Symbol. Every basic Symbol will correspond to one Node, and every Node has its own NodeAttr. There is a op field in NodeAttr class, when a Symbol represents Variable(often input data), the op field is null.
|
||||
|
||||
Symbol contains a data member, std::vector<NodeEntry> outputs, and NodeEntry cantains a poniter to Node. We can follow the Node pointer to get all the Graph.
|
||||
|
||||
And Symbol can be saved to a Json file.
|
||||
|
||||
Here is a detailed example:
|
||||
|
||||
```
|
||||
>>> import mxnet as mx
|
||||
>>> data = mx.symbol.Variable('data')
|
||||
>>> print data.debug_str()
|
||||
Variable:data
|
||||
|
||||
>>> data = mx.symbol.Flatten(data=data)
|
||||
>>> print data.debug_str()
|
||||
Symbol Outputs:
|
||||
output[0]=flatten0(0)
|
||||
Variable:data
|
||||
--------------------
|
||||
Op:Flatten, Name=flatten0
|
||||
Inputs:
|
||||
arg[0]=data(0) version=0
|
||||
|
||||
>>> fc1 = mx.symbol.FullyConnected(data = data, name='fc1', num_hidden=128)
|
||||
>>> print fc1.debug_str()
|
||||
Symbol Outputs:
|
||||
output[0]=fc1(0)
|
||||
Variable:data
|
||||
--------------------
|
||||
Op:Flatten, Name=flatten0
|
||||
Inputs:
|
||||
arg[0]=data(0) version=0
|
||||
Variable:fc1_weight
|
||||
Variable:fc1_bias
|
||||
--------------------
|
||||
Op:FullyConnected, Name=fc1
|
||||
Inputs:
|
||||
arg[0]=flatten0(0)
|
||||
arg[1]=fc1_weight(0) version=0
|
||||
arg[2]=fc1_bias(0) version=0
|
||||
Attrs:
|
||||
num_hidden=128
|
||||
|
||||
```
|
||||
|
||||
|
||||
### TensorFlow
|
||||
|
||||
|
||||
The core concept of symbolic API is `Tensor`. Tensorflow defines `Tensor` in Python. Please refer to the comments in TensorFlow:
|
||||
|
||||
A `Tensor` is a symbolic handle to one of the outputs of an `Operation`. It does not hold the values of that operation's output, but instead provides a means of computing those values in a TensorFlow [Session](https://www.tensorflow.org/api_docs/python/tf/Session).
|
||||
|
||||
A simple example is as follows:
|
||||
|
||||
```python
|
||||
# Build a dataflow graph.
|
||||
c = tf.constant([[1.0, 2.0], [3.0, 4.0]])
|
||||
d = tf.constant([[1.0, 1.0], [0.0, 1.0]])
|
||||
e = tf.matmul(c, d)
|
||||
|
||||
# Construct a `Session` to execute the graph.
|
||||
sess = tf.Session()
|
||||
|
||||
# Execute the graph and store the value that `e` represents in `result`.
|
||||
result = sess.run(e)
|
||||
```
|
||||
|
||||
|
||||
The main method of `Tensor` is as follows:
|
||||
|
||||
|
||||
```python
|
||||
@property
|
||||
def op(self):
|
||||
"""The `Operation` that produces this tensor as an output."""
|
||||
return self._op
|
||||
|
||||
@property
|
||||
def dtype(self):
|
||||
"""The `DType` of elements in this tensor."""
|
||||
return self._dtype
|
||||
|
||||
@property
|
||||
def graph(self):
|
||||
"""The `Graph` that contains this tensor."""
|
||||
return self._op.graph
|
||||
|
||||
@property
|
||||
def name(self):
|
||||
"""The string name of this tensor."""
|
||||
if not self._op.name:
|
||||
raise ValueError("Operation was not named: %s" % self._op)
|
||||
return "%s:%d" % (self._op.name, self._value_index)
|
||||
|
||||
@property
|
||||
def device(self):
|
||||
"""The name of the device on which this tensor will be produced, or None."""
|
||||
return self._op.device
|
||||
```
|
||||
|
||||
|
||||
Tensor can be taken as target to run by session. Tensor contains all the information of Graph, and tracks data dependency.
|
||||
|
||||
|
||||
Here is a detailed example:
|
||||
|
||||
|
||||
```
|
||||
>>> import tensorflow as tf
|
||||
>>> c = tf.constant([[1.0, 2.0], [3.0, 4.0]])
|
||||
>>> print c.graph
|
||||
<tensorflow.python.framework.ops.Graph object at 0x10f256d50>
|
||||
>>> d = tf.constant([[1.0, 1.0], [0.0, 1.0]])
|
||||
>>> print d.graph
|
||||
<tensorflow.python.framework.ops.Graph object at 0x10f256d50>
|
||||
>>> e = tf.matmul(c, d)
|
||||
>>> print e.graph
|
||||
<tensorflow.python.framework.ops.Graph object at 0x10f256d50>
|
||||
```
|
||||
|
||||
### Dynet
|
||||
|
||||
|
||||
The core concept of symbolic API is `Expression`, and Dynet defines `Expression` class in C++.
|
||||
|
||||
|
||||
A simple example is as follows:
|
||||
|
||||
```cpp
|
||||
ComputationGraph cg;
|
||||
Expression W = parameter(cg, pW);
|
||||
|
||||
Expression in = input(cg, xs[i]);
|
||||
Expression label = input(cg, ys[i]);
|
||||
Expression pred = W * in;
|
||||
Expression loss = square(pred - label);
|
||||
```
|
||||
|
||||
The input data and parameter are also represented by Expression. Every basci Expression corresponds to a Node. And input data is also a Node.
|
||||
|
||||
Expression has a data member ComputationGraph, and ComputationGraph will be modified in users' configuring process. Expression can be a running target, beacuse Expression contains all dependency.
|
||||
|
||||
|
||||
Here is a detailed example:
|
||||
|
||||
write topology in C++
|
||||
|
||||
```
|
||||
ComputationGraph cg;
|
||||
Expression W = parameter(cg, pW);
|
||||
cg.print_graphviz();
|
||||
|
||||
Expression pred = W * xs[i];
|
||||
cg.print_graphviz();
|
||||
|
||||
Expression loss = square(pred - ys[i]);
|
||||
cg.print_graphviz();
|
||||
```
|
||||
|
||||
compile and print
|
||||
|
||||
```
|
||||
# first print
|
||||
digraph G {
|
||||
rankdir=LR;
|
||||
nodesep=.05;
|
||||
N0 [label="v0 = parameters({1}) @ 0x7ffe4de00110"];
|
||||
}
|
||||
# second print
|
||||
digraph G {
|
||||
rankdir=LR;
|
||||
nodesep=.05;
|
||||
N0 [label="v0 = parameters({1}) @ 0x7ffe4de00110"];
|
||||
N1 [label="v1 = v0 * -0.98"];
|
||||
N0 -> N1;
|
||||
}
|
||||
# third print
|
||||
digraph G {
|
||||
rankdir=LR;
|
||||
nodesep=.05;
|
||||
N0 [label="v0 = parameters({1}) @ 0x7ffe4de00110"];
|
||||
N1 [label="v1 = v0 * -0.98"];
|
||||
N0 -> N1;
|
||||
N2 [label="v2 = -1.88387 - v1"];
|
||||
N1 -> N2;
|
||||
N3 [label="v3 = -v2"];
|
||||
N2 -> N3;
|
||||
N4 [label="v4 = square(v3)"];
|
||||
N3 -> N4;
|
||||
}
|
||||
```
|
||||
|
||||
### Conclusion
|
||||
|
||||
|
||||
Actually, Symbol/Tensor/Expression in Mxnet/TensorFlow/Dynet are the same level concepts. We use a unified name Expression here, this level concept has following features:
|
||||
|
||||
- Users wirte topoloy with symbolic API, and all return value is Expression, including input data and parameter.
|
||||
- Expression corresponds with a global Graph, and Expression can also be composed.
|
||||
- Expression tracks all dependency and can be taken as a run target
|
After Width: | Height: | Size: 620 B |
After Width: | Height: | Size: 156 B |
@ -0,0 +1,36 @@
|
||||
# Design Doc: Model Format
|
||||
|
||||
## Motivation
|
||||
|
||||
A model is an output of the training process. One complete model consists of two parts, the **topology** and the **parameters**. In order to support industrial deployment, the model format must be self-complete and must not expose any training source code.
|
||||
|
||||
As a result, In PaddlePaddle, the **topology** is represented as a [ProgramDesc](https://github.com/PaddlePaddle/Paddle/blob/1c0a4c901c9fc881d120249c703b15d1c50dae7d/doc/design/program.md), which describes the model structure. The **parameters** contain all the trainable weights in the model. We must support large size parameters and efficient serialization/deserialization of parameters.
|
||||
|
||||
## Implementation
|
||||
|
||||
The topology is saved as a plain text in a detailed self-contain protobuf file.
|
||||
|
||||
The parameters are saved as a binary file. As we all know, the protobuf message has a limit of [64M size](https://developers.google.com/protocol-buffers/docs/reference/cpp/google.protobuf.io.coded_stream#CodedInputStream.SetTotalBytesLimit.details). We have done a [benchmark experiment](https://github.com/PaddlePaddle/Paddle/pull/4610), which shows that protobuf is not fit for the task.
|
||||
|
||||
As a result, we design a particular format for tensor serialization. By default, an arbitrary tensor in Paddle is a [LoDTensor](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/lod_tensor.md), and has a description information proto of [LoDTensorDesc](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/framework.proto#L99). We save the DescProto as the byte string header. It contains all the necessary information, such as the `dims`, and the `LoD` information in [LoDTensor](https://github.com/PaddlePaddle/Paddle/blob/1c0a4c901c9fc881d120249c703b15d1c50dae7d/paddle/framework/lod_tensor.md). A tensor stores values in a continuous memory buffer. For speed we dump the raw memory to disk and save it as the byte string content. So, the binary format of one tensor is,
|
||||
|
||||
The table below shows a tensor's byte view in detail. Note that all the signed values are written in the little-endian format.
|
||||
|
||||
|field name | type | description |
|
||||
| --- | --- | --- |
|
||||
| version | uint32_t | Version of saved file. Always 0 now. |
|
||||
| tensor desc length | uint32_t | TensorDesc(Protobuf message) length in bytes. |
|
||||
| tensor desc | void* | TensorDesc protobuf binary message |
|
||||
| tensor data | void* | Tensor's data in binary format. The length of `tensor_data` is decided by `TensorDesc.dims()` and `TensorDesc.data_type()` |
|
||||
| lod_level | uint64_t | Level of LoD |
|
||||
| length of lod[0] | uint64_t | [Optional] length of lod[0] in bytes. |
|
||||
| data of lod[0] | uint64_t* | [Optional] lod[0].data() |
|
||||
| ... | ... | ... |
|
||||
|
||||
|
||||
|
||||
## Summary
|
||||
|
||||
- We introduce a model format.
|
||||
- The model represented by its forward-pass computation procedure is saved in a **ProgramDesc** protobuf message.
|
||||
- A bunch of specified format binary tensors describe the **parameters**.
|
@ -0,0 +1,72 @@
|
||||
# Averaging Parameter in PaddlePaddle
|
||||
|
||||
## Why Averaging
|
||||
In a large scale machine learning setup where the size of the training data is huge, it could take us a large number of iterations over the training data before we can achieve the optimal values of parameters of our model. Looking at the problem setup, it is desirable if we can obtain the optimal values of parameters by going through the data in as few passes as we can.
|
||||
|
||||
Polyak and Juditsky (1992) showed that the test performance of simple average of parameters obtained by Stochastic Gradient Descent (SGD) is as good as that of parameter values that are obtained by training the model over and over again, over the training dataset.
|
||||
|
||||
Hence, to accelerate the speed of Stochastic Gradient Descent, Averaged Stochastic Gradient Descent (ASGD) was proposed in Polyak and Juditsky (1992). For ASGD, the running average of parameters obtained by SGD, is used as the estimator for <img src="./images/theta_star.gif"/><br/> . The averaging is done as follows:
|
||||
|
||||
<img src="./images/asgd.gif" align="center"/><br/>
|
||||
|
||||
We propose averaging for any optimizer similar to how ASGD performs it, as mentioned above.
|
||||
|
||||
### How to perform Parameter Averaging in PaddlePaddle
|
||||
|
||||
Parameter Averaging in PaddlePaddle works in the following way during training :
|
||||
1. It will take in an instance of a normal optimizer as an input, e.g. RMSPropOptimizer
|
||||
2. The optimizer itself is responsible for updating the parameters.
|
||||
3. The ParameterAverageOptimizer maintains a separate copy of the parameters for itself:
|
||||
1. In concept, the values of this copy are the average of the values of the parameters in the most recent N batches.
|
||||
2. However, saving all the N instances of the parameters in memory is not feasible.
|
||||
3. Therefore, an approximation algorithm is used.
|
||||
|
||||
Hence, overall we have have two copies of the parameters: one for the optimizer itself, and one for the ParameterAverageOptimizer. The former should be used in back propagation, while the latter should be used during testing and should be saved.
|
||||
|
||||
During the testing/ saving the model phase, we perform the following steps:
|
||||
1. Perform the delayed operations.
|
||||
2. Save current values of the parameters to a temporary variable.
|
||||
3. Replace the values of the parameters with the averaged values.
|
||||
4. Perform testing and/or save the parameters.
|
||||
5. Restore the values of the parameters once done.
|
||||
|
||||
### How to implement Averaging of Parameter in PaddlePaddle
|
||||
|
||||
We can add the ParameterAverageOptimizer op to the graph through Python API. Using this approach, we manually add this op to the graph and direct the output of the optimizer op to this op during training.
|
||||
|
||||
**Advantages**:
|
||||
- Allows for greater flexibility to the users of PaddlePaddle. Using this approach, the users can plug different optimizers into ParameterAverageOptimizer by passing in the optimizer to the op.
|
||||
- Makes it easy for the users to customize and extend the framework.
|
||||
|
||||
**Disadvantages**:
|
||||
- Implementation requires re-writing the averaging methodology in Python.
|
||||
|
||||
### Low-Level implementation
|
||||
|
||||
In the new design, we propose to create a new operation for averaging parameter updates (ParameterAverageOptimizer). For now, we can add an op that takes in the following as input:
|
||||
- the optimizer
|
||||
- the window_size to keep the updates
|
||||
|
||||
The ParameterAverageOptimizer op can be like any other operator with its own CPU/GPU implementation either using Eigen or separate CPU and GPU kernels. As the initial implementation, we can implement the kernel using Eigen following the abstraction pattern implemented for [Operators](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/operators/rmsprop_op.h). We also want to support the case when the Trainer/Optimizer runs on the GPU while ParameterAverageOptimizer runs on a CPU.
|
||||
|
||||
The idea of building an op for averaging is in sync with the refactored PaddlePaddle philosophy of using operators to represent any computation unit. The way the op will be added to the computation graph will be decided by the [layer functions](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/python_api.md#layer-function) in Python API.
|
||||
|
||||
### Python API implementation for ParameterAverageOptimizer
|
||||
|
||||
Based on Polyak and Juditsky (1992), we can generalize the averaging of updates to any optimizer. The input to the op would be the following:
|
||||
- Any optimizer (RMSProp , AdaGrad etc.)
|
||||
- A window size. The op keeps accumulating updated parameter values over a window of N batches and takes an average. Move the averaged value to a buffer when window is full to avoid loss of precision.
|
||||
|
||||
Using the ParameterAverageOptimizer op, any user can add the operation to their computation graphs. However, this will require a lot of lines of code and we should design Python APIs that support averaging. As per the PaddlePaddle [Python API design](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/python_api.md), the layer functions are responsible for creating operators, operator parameters and variables. Since ParameterAverageOptimizer will be an operator, it makes sense to create it in the layer functions.
|
||||
We will have a wrapper written in Python that will support the functionality and implement the actual core computation in C++ core as we have done for other [Optimizers](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/operators/rmsprop_op.cc)
|
||||
|
||||
#### Creation of the ParameterAverageOptimizer operator
|
||||
There are two ways for creating the ParameterAverageOptimizer op:
|
||||
1. We create the op immediately while building the computation graph.
|
||||
2. We add the op in a lazy manner, just before the backward pass, similar to the way the optimization ops are added.
|
||||
|
||||
The proposal is to add the op immediately while building the computation graph.
|
||||
|
||||
#### High-level API
|
||||
|
||||
In PaddlePaddle Python API, users will primarily rely on [layer functions](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/python_api.md#layer-function) to create neural network layers. Hence, we also need to provide parameter average functionality in layer functions.
|
Some files were not shown because too many files have changed in this diff Show More
Loading…
Reference in new issue