After Width: | Height: | Size: 121 KiB |
@ -0,0 +1,122 @@
|
|||||||
|
# Design Doc: PaddlePaddle Fluid
|
||||||
|
|
||||||
|
## Why Fluid
|
||||||
|
|
||||||
|
When Baidu developed PaddlePaddle in 2013, the only well-known open source deep learning system at the time was Caffe. However, when PaddlePaddle was open-sourced in 2016, many other choices were available. There was a challenge -- what is the need for open sourcing yet another deep learning framework?
|
||||||
|
|
||||||
|
Fluid is the answer. Fluid is similar to PyTorch and TensorFlow Eager Execution, which describes the "process" of training or inference using the concept of a model. In fact in PyTorch, TensorFlow Eager Execution and Fluid, there is no concept of a model at all. The details are covered in the sections below. Fluid is currently more extreme in the above mentioned idea than PyTorch and Eager Execution, and we are trying to push Fluid towards the directions of a compiler and a new programming language for deep learning.
|
||||||
|
|
||||||
|
## The Evolution of Deep Learning Systems
|
||||||
|
|
||||||
|
Deep learning infrastructure is one of the fastest evolving technologies. Within four years, there have already been three generations of technologies invented.
|
||||||
|
|
||||||
|
| Existed since | model as sequence of layers | model as graph of operators | No model |
|
||||||
|
|--|--|--|--|
|
||||||
|
| 2013 | Caffe, Theano, Torch, PaddlePaddle | | |
|
||||||
|
| 2015 | | TensorFlow, MxNet, Caffe2, ONNX, n-graph | |
|
||||||
|
| 2016 | | | PyTorch, TensorFlow Eager Execution, PaddlePaddle Fluid |
|
||||||
|
|
||||||
|
From the above table, we see that the deep learning technology is evolving towards getting rid of the concept of a model. To understand the reasons behind this direction, a comparison of the *programming paradigms* or the ways to program deep learning applications using these systems, would be helpful. The following section goes over these.
|
||||||
|
|
||||||
|
## Deep Learning Programming Paradigms
|
||||||
|
|
||||||
|
With the systems listed as the first or second generation, e.g., Caffe or TensorFlow, an AI application training program looks like the following:
|
||||||
|
|
||||||
|
```python
|
||||||
|
x = layer.data("image")
|
||||||
|
l = layer.data("label")
|
||||||
|
f = layer.fc(x, W)
|
||||||
|
s = layer.softmax(f)
|
||||||
|
c = layer.mse(l, s)
|
||||||
|
|
||||||
|
for i in xrange(1000): # train for 1000 iterations
|
||||||
|
m = read_minibatch()
|
||||||
|
forward({input=x, data=m}, minimize=c)
|
||||||
|
backward(...)
|
||||||
|
|
||||||
|
print W # print the trained model parameters.
|
||||||
|
```
|
||||||
|
|
||||||
|
The above program includes two parts:
|
||||||
|
|
||||||
|
1. The first part describes the model, and
|
||||||
|
2. The second part describes the training process (or inference process) for the model.
|
||||||
|
|
||||||
|
This paradigm has a well-known problem that limits the productivity of programmers. If the programmer made a mistake in configuring the model, the error messages wouldn't show up until the second part is executed and `forward` and `backward` propagations are performed. This makes it difficult for the programmer to debug and locate a mistake that is located blocks away from the actual error prompt.
|
||||||
|
|
||||||
|
This problem of being hard to debug and re-iterate fast on a program is the primary reason that programmers, in general, prefer PyTorch over the older systems. Using PyTorch, we would write the above program as following:
|
||||||
|
|
||||||
|
```python
|
||||||
|
W = tensor(...)
|
||||||
|
|
||||||
|
for i in xrange(1000): # train for 1000 iterations
|
||||||
|
m = read_minibatch()
|
||||||
|
x = m["image"]
|
||||||
|
l = m["label"]
|
||||||
|
f = layer.fc(x, W)
|
||||||
|
s = layer.softmax(f)
|
||||||
|
c = layer.mse(l, s)
|
||||||
|
backward()
|
||||||
|
|
||||||
|
print W # print the trained model parameters.
|
||||||
|
```
|
||||||
|
|
||||||
|
We can see that the main difference is the moving the model configuration part (the first step) into the training loop. This change would allow the mistakes in model configuration to be reported where they actually appear in the programming block. This change also represents the model better, or its forward pass, by keeping the configuration process in the training loop.
|
||||||
|
|
||||||
|
## Describe Arbitrary Models for the Future
|
||||||
|
|
||||||
|
Describing the process instead of the model also brings Fluid, the flexibility to define different non-standard models that haven't been invented yet.
|
||||||
|
|
||||||
|
As we write out the program for the process, we can write an RNN as a loop, instead of an RNN as a layer or as an operator. A PyTorch example would look like the following:
|
||||||
|
|
||||||
|
```python
|
||||||
|
for i in xrange(1000):
|
||||||
|
m = read_minibatch()
|
||||||
|
x = m["sentence"]
|
||||||
|
for t in xrange x.len():
|
||||||
|
h[t] = the_step(x[t])
|
||||||
|
```
|
||||||
|
|
||||||
|
With Fluid, the training loop and the RNN in the above program are not really Python loops, but just a "loop structure" provided by Fluid and implemented in C++ as the following:
|
||||||
|
|
||||||
|
```python
|
||||||
|
train_loop = layers.While(cond)
|
||||||
|
with train_loop.block():
|
||||||
|
m = read_minibatch()
|
||||||
|
x = m["sentence"]
|
||||||
|
rnn = layers.While(...)
|
||||||
|
with rnn.block():
|
||||||
|
h[t] = the_step(input[t])
|
||||||
|
```
|
||||||
|
|
||||||
|
An actual Fluid example is described [here](https://github.com/PaddlePaddle/Paddle/blob/a91efdde6910ce92a78e3aa7157412c4c88d9ee8/python/paddle/v2/fluid/tests/test_while_op.py#L36-L44).
|
||||||
|
|
||||||
|
From the example, the Fluid programs look very similar to their PyTorch equivalent programs, except that Fluid's loop structure, wrapped with Python's `with` statement, could run much faster than just a Python loop.
|
||||||
|
|
||||||
|
We have more examples of the [`if-then-else`](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/if_else_op.md) structure of Fluid.
|
||||||
|
|
||||||
|
## Turing Completeness
|
||||||
|
|
||||||
|
In computability theory, a system of data-manipulation rules, such as a programming language, is said to be Turing complete if it can be used to simulate any Turing machine. For a programming language, if it provides if-then-else and loop, it is Turing complete. From the above examples, Fluid seems to be Turing complete; however, it is noteworthy to notice that there is a slight difference between the `if-then-else` of Fluid and that of a programming language. The difference being that the former runs both of its branches and splits the input mini-batch into two -- one for the True condition and another for the False condition. This hasn't been researched in depth if this is equivalent to the `if-then-else` in programming languages that makes them Turing-complete. Based on a conversation with [Yuang Yu](https://research.google.com/pubs/104812.html), it seems to be the case but this needs to be looked into in-depth.
|
||||||
|
|
||||||
|
## The Execution of a Fluid Program
|
||||||
|
|
||||||
|
There are two ways to execute a Fluid program. When a program is executed, it creates a protobuf message [`ProgramDesc`](https://github.com/PaddlePaddle/Paddle/blob/a91efdde6910ce92a78e3aa7157412c4c88d9ee8/paddle/framework/framework.proto#L145) that describes the process and is conceptually like an [abstract syntax tree](https://en.wikipedia.org/wiki/Abstract_syntax_tree).
|
||||||
|
|
||||||
|
There is a C++ class [`Executor`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/executor.h), which runs a `ProgramDesc`, similar to how an interpreter runs a Python program.
|
||||||
|
|
||||||
|
Fluid is moving towards the direction of a compiler, which is explain in more detail later in this article.
|
||||||
|
|
||||||
|
## Backward Compatibility of Fluid
|
||||||
|
|
||||||
|
Given all the advantages from the removal of the concept of a *model*, hardware manufacturers might still prefer the existence of the concept of a model, so it would be easier for them to support multiple frameworks all at once and could run a trained model during inference. For example, Nervana, a startup company acquired by Intel, has been working on an XPU that reads the models in the format known as [n-graph](https://github.com/NervanaSystems/ngraph). Similarly, [Movidius](https://www.movidius.com/) is producing a mobile deep learning chip that reads and runs graphs of operators. The well-known [ONNX](https://github.com/onnx/onnx) is also a file format of graphs of operators.
|
||||||
|
|
||||||
|
For Fluid, we can write a converter that extracts the parts in the `ProgramDesc` protobuf message, converts them into a graph of operators, and exports the graph into the ONNX or n-graph format.
|
||||||
|
|
||||||
|
## Towards a Deep Learning Language and the Compiler
|
||||||
|
|
||||||
|
We can change the `if-then-else` and loop structure a little bit in the above Fluid example programs, to make it into a new programming language, different than Python.
|
||||||
|
|
||||||
|
Even if we do not invent a new language, as long as we get the `ProgramDesc` message filled in, we can write a transpiler, which translates each invocation to an operator, into a C++ call to a kernel function of that operator. For example, a transpiler that weaves the CUDA kernels outputs an NVIDIA-friendly C++ program, which can be built using `nvcc`. Another transpiler could generate MKL-friendly code that should be built using `icc` from Intel. More interestingly, we can translate a Fluid program into its distributed version of two `ProgramDesc` messages, one for running on the trainer process, and the other one for the parameter server. For more details of the last example, the [concurrent programming design](concurrent_programming.md) document would be a good pointer. The following figure explains the proposed two-stage process:
|
||||||
|
|
||||||
|
![](fluid-compiler.png)
|
After Width: | Height: | Size: 108 KiB |
After Width: | Height: | Size: 33 KiB |
Before Width: | Height: | Size: 13 KiB After Width: | Height: | Size: 13 KiB |
Before Width: | Height: | Size: 22 KiB After Width: | Height: | Size: 22 KiB |
Before Width: | Height: | Size: 11 KiB After Width: | Height: | Size: 11 KiB |
Before Width: | Height: | Size: 18 KiB After Width: | Height: | Size: 18 KiB |
Before Width: | Height: | Size: 10 KiB After Width: | Height: | Size: 10 KiB |
@ -0,0 +1,65 @@
|
|||||||
|
# Design Doc: NCCL support in Paddle Fluid
|
||||||
|
|
||||||
|
## Abstract
|
||||||
|
|
||||||
|
This Design Doc refers to the NCCL feature in paddle. We propose an approach to support NCCL library both on a single machine and multiple machines. We wrapper the NCCL primitives `Broadcast`, `Allreduce`, `Reduce` as operators to utilize Multi-GPU powers in one script.
|
||||||
|
|
||||||
|
|
||||||
|
## Motivation
|
||||||
|
|
||||||
|
[NCCL](https://developer.nvidia.com/nccl) is a NVIDIA library support Multi-GPU communicating and optimized for NVIDIA GPUs, it provides routines such as all-gather, all-reduce, broadcast, reduce, reduce-scatter, that can achieve high bandwidth over PCIe and NVLink high-speed interconnect. With NCCL library, we can easily accelerate the training in parallel.
|
||||||
|
|
||||||
|
- Pros
|
||||||
|
1. easily plug-in with [NCCL2](https://developer.nvidia.com/nccl) library.
|
||||||
|
1. high performance in NVIDIA GPUs.
|
||||||
|
1. MPI like primitives, which have low learning cost for users.
|
||||||
|
|
||||||
|
- Cons
|
||||||
|
1. Only design for NVIDIA GPUs, not a general multi-device solution.
|
||||||
|
1. Although NCCL1 is opensourced under BSD license, but NCCL2 is not opensourced anymore.
|
||||||
|
|
||||||
|
At the beginning of training, the framework needs to distribute the same parameters to every GPU, and merge the gradients at any time user interests.
|
||||||
|
|
||||||
|
As a result, during training, we need the operations of peer to peer copy between different GPUs, aggregating gradients/parameters from GPUs, and broadcasting parameters to GPUs. Every GPU only need to run the operator with correct place information.
|
||||||
|
|
||||||
|
Besides, it needs interfaces to synchronize model update with each different GPU Cards.
|
||||||
|
|
||||||
|
## Implementation
|
||||||
|
|
||||||
|
As mentioned above, we wrap the NCCL routines as several kinds of operators. Need to note that NCCL need to create Communicator between gpu at the beginning, so there is a NCCLInit operator created.
|
||||||
|
|
||||||
|
### Transpiler
|
||||||
|
|
||||||
|
To be compatible with [parameter server design doc](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/ops/dist_train.md), the transpiler compiles the user defined operation graph into sub-graphs to be executed on different devices.
|
||||||
|
|
||||||
|
1. The user-defined model will be a single device program
|
||||||
|
|
||||||
|
2. Broadcast/Reduce operators between GPUs will be inserted into the program, even for the multi-node, may insert the `Send`, `Recv` operator.
|
||||||
|
|
||||||
|
*Broadcast, AllReduce in a single machine. And Broadcast, AllReduce, [Send, Recv](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/ops/dist_train.md#graph-converter) in multiple machines*
|
||||||
|
|
||||||
|
<img src="images/multigpu_before_convert.png" width="300"/>
|
||||||
|
|
||||||
|
After compiling, the graph as shows
|
||||||
|
|
||||||
|
<img src="images/multigpu_allreduce.png" width="1000"/>
|
||||||
|
|
||||||
|
Operators are added to the sub-graphs. Every GPU assigned a role of `rank0`, `rank1` etc.
|
||||||
|
|
||||||
|
- **Broadcast**. Broadcast operator distribute initialized parameter to all the GPUs from the GPU who owns it. e.g. from`rank0` GPU.
|
||||||
|
- **AllReduce**. AllReduce operator synchronizes parameters/gradients between GPUs. AllReduce implemented in the Ring-Based communicating method, avoid of the bottle neck in a single GPU.
|
||||||
|
|
||||||
|
Need to notice that AllReduce operator force GPUs synchronized at that point. The whole training process in asynchronous or synchronous mode depends on the AllReduce point in the graph.
|
||||||
|
|
||||||
|
As it shown in the picture, when each GPU compute the gradient of `W`, followed with a `AllReduce` operator, accumulate the `dW` to full batch of data, then run the optimize process individually and apply the gradient to its `W`.
|
||||||
|
|
||||||
|
- **AllReduce**
|
||||||
|
Need to note that our AllReduce operator is a ring-base AllReduce implementation. If we use the NCCL2 AllReduce primitive, every GPU optimized full batch of data, wasted (n-1) GPU compute resources. In addition, NCCL2 built-in AllReduce will only utilize the communicating resource during synchronization, then update the gradient will be a subsequent phase. In fact, we can amortize the update gradient time cost into the communicating phase. The process is
|
||||||
|
1. Every parameter has its root card. That card will responsible for aggregating the gradients from GPUs.
|
||||||
|
2. The whole model's parameter will be hashed to different root card, ensure the load balance between GPUs.
|
||||||
|
3. Logically neighberhood card will start send parameter to the next one. After one round, the parameter main card will aggregate the full gradients.
|
||||||
|
4. Then the root card will optimize the parameter.
|
||||||
|
5. This parameter card will send its optimized result to its neighberhood, then the neighberhood will send parameter to its next one.
|
||||||
|
6. Finish the sychronization round.
|
||||||
|
|
||||||
|
The total time cost will be 2 * (n-1) * per-parameter-send-time, we reach the goal of amortize the upgrade time into communicating phase.
|
@ -0,0 +1,248 @@
|
|||||||
|
# Design Doc: Supporting new Device/Library
|
||||||
|
|
||||||
|
## Background
|
||||||
|
|
||||||
|
Deep learning has a high demand for computing resources. New high-performance devices and computing libraries are appearing very frequently. Deep learning frameworks have to integrate these high-performance devices and computing libraries flexibly and efficiently.
|
||||||
|
|
||||||
|
On one hand, hardware and computing libraries usually do not have a one-to-one correspondence. For example,Intel CPUs support Eigen and MKL computing libraries while Nvidia GPUs support Eigen and cuDNN computing libraries. We have to implement operator specific kernels for each computing library.
|
||||||
|
|
||||||
|
On the other hand, users usually do not want to care about the low-level hardware and computing libraries when writing a neural network configuration. In Fluid, `Layer` is exposed in `Python`, and `Operator` is exposed in `C++`. Both `Layer` and `Operator` are hardware independent.
|
||||||
|
|
||||||
|
So, how to support a new Device/Library in Fluid becomes a challenge.
|
||||||
|
|
||||||
|
|
||||||
|
## Basic: Integrate A New Device/Library
|
||||||
|
|
||||||
|
For a general overview of fluid, please refer to the [overview doc](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/read_source.md).
|
||||||
|
|
||||||
|
There are mainly three parts that we have to consider while integrating a new device/library:
|
||||||
|
|
||||||
|
- Place and DeviceContext: indicates the device id and manages hardware resources
|
||||||
|
|
||||||
|
- Memory and Tensor: malloc/free data on certain device
|
||||||
|
|
||||||
|
- Math Functor and OpKernel: implement computing unit on certain devices/libraries
|
||||||
|
|
||||||
|
### Place and DeviceContext
|
||||||
|
|
||||||
|
|
||||||
|
#### Place
|
||||||
|
Fluid uses class [Place](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/platform/place.h#L55) to represent different devices and computing libraries. There are inheritance relationships between different kinds of `Place`.
|
||||||
|
|
||||||
|
```
|
||||||
|
| CPUPlace --> MKLDNNPlace
|
||||||
|
Place --| CUDAPlace --> CUDNNPlace
|
||||||
|
| FPGAPlace
|
||||||
|
```
|
||||||
|
|
||||||
|
And `Place` is defined as follows:
|
||||||
|
|
||||||
|
```
|
||||||
|
typedef boost::variant<CUDAPlace, CPUPlace, FPGAPlace> Place;
|
||||||
|
```
|
||||||
|
|
||||||
|
#### DeviceContext
|
||||||
|
|
||||||
|
Fluid uses class [DeviceContext](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/platform/device_context.h#L30) to manage the resources in different hardwares, such as CUDA stream in `CDUADeviceContext`. There are also inheritance relationships between different kinds of `DeviceContext`.
|
||||||
|
|
||||||
|
|
||||||
|
```
|
||||||
|
/-> CPUDeviceContext --> MKLDeviceContext
|
||||||
|
DeviceContext ----> CUDADeviceContext --> CUDNNDeviceContext
|
||||||
|
\-> FPGADeviceContext
|
||||||
|
```
|
||||||
|
|
||||||
|
An example of Nvidia GPU is as follows:
|
||||||
|
|
||||||
|
- DeviceContext
|
||||||
|
|
||||||
|
|
||||||
|
```
|
||||||
|
class DeviceContext {
|
||||||
|
virtual Place GetPlace() const = 0;
|
||||||
|
};
|
||||||
|
```
|
||||||
|
|
||||||
|
|
||||||
|
- CUDADeviceContext
|
||||||
|
|
||||||
|
|
||||||
|
```
|
||||||
|
class CUDADeviceContext : public DeviceContext {
|
||||||
|
Place GetPlace() const override { return place_; }
|
||||||
|
private:
|
||||||
|
CUDAPlace place_;
|
||||||
|
cudaStream_t stream_;
|
||||||
|
cublasHandle_t cublas_handle_;
|
||||||
|
std::unique_ptr<Eigen::GpuDevice> eigen_device_; // binds with stream_
|
||||||
|
};
|
||||||
|
```
|
||||||
|
|
||||||
|
- CUDNNDeviceContext
|
||||||
|
|
||||||
|
```
|
||||||
|
class CUDNNDeviceContext : public CUDADeviceContext {
|
||||||
|
private:
|
||||||
|
cudnnHandle_t cudnn_handle_;
|
||||||
|
};
|
||||||
|
```
|
||||||
|
|
||||||
|
|
||||||
|
### Memory and Tensor
|
||||||
|
|
||||||
|
|
||||||
|
#### memory module
|
||||||
|
|
||||||
|
Fluid provides the following [memory interfaces](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/memory/memory.h#L36):
|
||||||
|
|
||||||
|
```
|
||||||
|
template <typename Place>
|
||||||
|
void* Alloc(Place place, size_t size);
|
||||||
|
|
||||||
|
template <typename Place>
|
||||||
|
void Free(Place place, void* ptr);
|
||||||
|
|
||||||
|
template <typename Place>
|
||||||
|
size_t Used(Place place);
|
||||||
|
```
|
||||||
|
|
||||||
|
To implementing these interfaces, we have to implement MemoryAllocator for different Devices
|
||||||
|
|
||||||
|
|
||||||
|
#### Tensor
|
||||||
|
|
||||||
|
[Tensor](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/tensor.h#L36) holds data with some shape in a specific Place.
|
||||||
|
|
||||||
|
```cpp
|
||||||
|
class Tensor {
|
||||||
|
public:
|
||||||
|
/*! Return a pointer to mutable memory block. */
|
||||||
|
template <typename T>
|
||||||
|
inline T* data();
|
||||||
|
|
||||||
|
/**
|
||||||
|
* @brief Return a pointer to mutable memory block.
|
||||||
|
* @note If not exist, then allocation.
|
||||||
|
*/
|
||||||
|
template <typename T>
|
||||||
|
inline T* mutable_data(platform::Place place);
|
||||||
|
|
||||||
|
/**
|
||||||
|
* @brief Return a pointer to mutable memory block.
|
||||||
|
*
|
||||||
|
* @param[in] dims The dimensions of the memory block.
|
||||||
|
* @param[in] place The place of the memory block.
|
||||||
|
*
|
||||||
|
* @note If not exist, then allocation.
|
||||||
|
*/
|
||||||
|
template <typename T>
|
||||||
|
inline T* mutable_data(DDim dims, platform::Place place);
|
||||||
|
|
||||||
|
/*! Resize the dimensions of the memory block. */
|
||||||
|
inline Tensor& Resize(const DDim& dims);
|
||||||
|
|
||||||
|
/*! Return the dimensions of the memory block. */
|
||||||
|
inline const DDim& dims() const;
|
||||||
|
|
||||||
|
private:
|
||||||
|
/*! holds the memory block if allocated. */
|
||||||
|
std::shared_ptr<Placeholder> holder_;
|
||||||
|
|
||||||
|
/*! points to dimensions of memory block. */
|
||||||
|
DDim dim_;
|
||||||
|
};
|
||||||
|
```
|
||||||
|
|
||||||
|
`Placeholder` is used to delay memory allocation; that is, we can first define a tensor, using `Resize` to configure its shape, and then call `mutuable_data` to allocate the actual memory.
|
||||||
|
|
||||||
|
```cpp
|
||||||
|
paddle::framework::Tensor t;
|
||||||
|
paddle::platform::CPUPlace place;
|
||||||
|
// set size first
|
||||||
|
t.Resize({2, 3});
|
||||||
|
// allocate memory on CPU later
|
||||||
|
t.mutable_data(place);
|
||||||
|
```
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
### Math Functor and OpKernel
|
||||||
|
|
||||||
|
Fluid implements computing units based on different DeviceContexts. Some computing units are shared between operators. This common part will be put in operators/math directory as basic Functors.
|
||||||
|
|
||||||
|
Let's take [MaxOutFunctor](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/operators/math/maxouting.h#L27) as an example:
|
||||||
|
|
||||||
|
The interface is defined in header file.
|
||||||
|
|
||||||
|
```
|
||||||
|
template <typename DeviceContext, typename T>
|
||||||
|
class MaxOutFunctor {
|
||||||
|
public:
|
||||||
|
void operator()(const DeviceContext& context, const framework::Tensor& input,
|
||||||
|
framework::Tensor* output, int groups);
|
||||||
|
};
|
||||||
|
```
|
||||||
|
|
||||||
|
CPU implemention is in .cc file
|
||||||
|
|
||||||
|
```
|
||||||
|
template <typename T>
|
||||||
|
class MaxOutFunctor<platform::CPUDeviceContext, T> {
|
||||||
|
public:
|
||||||
|
void operator()(const platform::CPUDeviceContext& context,
|
||||||
|
const framework::Tensor& input, framework::Tensor* output,
|
||||||
|
int groups) {
|
||||||
|
...
|
||||||
|
}
|
||||||
|
};
|
||||||
|
```
|
||||||
|
|
||||||
|
CUDA implemention is in .cu file
|
||||||
|
|
||||||
|
```
|
||||||
|
template <typename T>
|
||||||
|
class MaxOutFunctor<platform::CUDADeviceContext, T> {
|
||||||
|
public:
|
||||||
|
void operator()(const platform::CUDADeviceContext& context,
|
||||||
|
const framework::Tensor& input, framework::Tensor* output,
|
||||||
|
int groups) {
|
||||||
|
...
|
||||||
|
}
|
||||||
|
};
|
||||||
|
```
|
||||||
|
|
||||||
|
|
||||||
|
We get computing handle from a concrete DeviceContext, and make compution on tensors.
|
||||||
|
|
||||||
|
The implemention of `OpKernel` is similar to math functors, the extra thing we need to do is to register the OpKernel in a global map.
|
||||||
|
|
||||||
|
Fluid provides different register interfaces in op_registry.h
|
||||||
|
|
||||||
|
|
||||||
|
Let's take [Crop](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/operators/crop_op.cc#L134) operator as an example:
|
||||||
|
|
||||||
|
In .cc file:
|
||||||
|
|
||||||
|
```
|
||||||
|
REGISTER_OP_CPU_KERNEL(crop, ops::CropKernel<float>);
|
||||||
|
REGISTER_OP_CPU_KERNEL(
|
||||||
|
crop_grad, ops::CropGradKernel<paddle::platform::CPUDeviceContext, float>);
|
||||||
|
```
|
||||||
|
|
||||||
|
In .cu file:
|
||||||
|
|
||||||
|
```
|
||||||
|
REGISTER_OP_CUDA_KERNEL(crop, ops::CropKernel<float>);
|
||||||
|
REGISTER_OP_CUDA_KERNEL(
|
||||||
|
crop_grad, ops::CropGradKernel<paddle::platform::CUDADeviceContext, float>);
|
||||||
|
```
|
||||||
|
|
||||||
|
|
||||||
|
## Advanced topics: How to switch between different Device/Library
|
||||||
|
|
||||||
|
Generally, we will impelement OpKernel for all Device/Library of an Operator. We can easily train a Convolutional Neural Network in GPU. However, some OpKernel is not sutibale on a specific Device. For example, crf operator can only run on CPU, whereas most other operators can run at GPU. To achieve high performance in such circumstance, we have to switch between different Device/Library.
|
||||||
|
|
||||||
|
|
||||||
|
We will discuss how to implement an efficient OpKernel switch policy.
|
||||||
|
|
||||||
|
- TBD
|