Add Fluid Compiler design doc (#7178)

* Add fluid_compiler.md * Paragraphing
8 years ago · 2dc5c69ecc
parent eee62648cf
commit 2dc5c69ecc
2 changed files with 111 additions and 9 deletions
--- a/doc/design/fluid.md
+++ b/doc/design/fluid.md
@ -105,18 +105,10 @@ There are two ways to execute a Fluid program.  When a program is executed, it c
 There is a C++ class [`Executor`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/executor.h), which runs a `ProgramDesc`, similar to how an interpreter runs a Python program.
-Fluid is moving towards the direction of a compiler, which is explain in more detail later in this article.
+Fluid is moving towards the direction of a compiler, which is explain in [fluid_compiler.md](fluid_compiler.md).
 ## Backward Compatibility of Fluid
 Given all the advantages from the removal of the concept of a *model*, hardware manufacturers might still prefer the existence of the concept of a model, so it would be easier for them to support multiple frameworks all at once and could run a trained model during inference.  For example, Nervana, a startup company acquired by Intel, has been working on an XPU that reads the models in the format known as [n-graph](https://github.com/NervanaSystems/ngraph).  Similarly, [Movidius](https://www.movidius.com/) is producing a mobile deep learning chip that reads and runs graphs of operators.  The well-known [ONNX](https://github.com/onnx/onnx) is also a file format of graphs of operators.
 For Fluid, we can write a converter that extracts the parts in the `ProgramDesc` protobuf message, converts them into a graph of operators, and exports the graph into the ONNX or n-graph format.
 ## Towards a Deep Learning Language and the Compiler
 We can change the `if-then-else` and loop structure a little bit in the above Fluid example programs, to make it into a new programming language, different than Python.
 Even if we do not invent a new language, as long as we get the `ProgramDesc` message filled in, we can write a transpiler, which translates each invocation to an operator, into a C++ call to a kernel function of that operator. For example, a transpiler that weaves the CUDA kernels outputs an NVIDIA-friendly C++ program, which can be built using `nvcc`.  Another transpiler could generate MKL-friendly code that should be built using `icc` from Intel.  More interestingly, we can translate a Fluid program into its distributed version of two `ProgramDesc` messages, one for running on the trainer process, and the other one for the parameter server.  For more details of the last example, the [concurrent programming design](concurrent_programming.md) document would be a good pointer.  The following figure explains the proposed two-stage process:
 ![](fluid-compiler.png)
--- a/doc/design/fluid_compiler.md
+++ b/doc/design/fluid_compiler.md
@ -0,0 +1,110 @@
 # PaddlePaddle Fluid: Towards a Compiled Programming Language
 As described in [fluid.md](fluid.md), when a Fluid application program
 runs, it generates a `ProgramDesc` protobuf message as an intermediate
 representation of itself.  The C++ class `Executor` can run this
 protobuf message as an interpreter.  This article describes the Fluid
 compiler.
 ![](fluid-compiler.png)
 ## ProgramDesc
 Before we go deeper into the idea of compiled language, let us take a
 look at a simple example Fluid application.
 ```python
 import "fluid"
 func paddlepaddle() {
  X = fluid.read(...)
  W = fluid.Tensor(...)
  Y = fluid.mult(X, W)
 }
 ```
 This program consists of a [block](block.md) of three operators --
 `read`, `assign`, and `mult`.  Its `ProgramDesc` message looks like
 the following
 ```protobuf
 message ProgramDesc {
  block[0] = Block {
    vars = [X, W, Y],
    ops = [
      read(output = X)
      assign(input = ..., output = W)
      mult(input = {X, W}, output = Y)
    ],
  }
 }
 ```
 ## Transpilers
 We can write a transpiler program that takes a `ProgramDesc`, e.g.,
 the above one, and outputs another `ProgramDesc`.  Let us take some
 examples:
 1. *Memory optimization transpiler*: We can write a transpiler that
   inserts some `FreeMemoryOp`s in the above example `ProgramDesc` so
   to free memory early, before the end of an iteration, so to keep a
   small memory footprint.
 1. *Distributed training transpiler*: We can write a transpiler that
   converts a`ProgramDesc` into its distributed version of two
   `ProgramDesc`s -- one for running by the trainer processes and the
   other for the parameter server.
 In the rest of this article, we talk about a special kind of
 transpiler, *Native code generator*, which takes a `ProgramDesc` and
 generates a `.cu` (or `.cc`) file, which could be built by C++
 compilers (gcc, nvcc, icc) into binaries.
 ## Native Code Generator
 For the above example, the native code generator transpiler, say, the
 CUDA code generator, should generate a `main` function:
 ```c++
 void main() {
  auto X = fluid_cuda_read(...);
  auto W = fluid_cuda_create_tensor(...);
  auto Y = fluid_cuda_mult(X, W);
 }
 ```
 and the definitions of functions `fluid_cuda_read`,
 `fluid_cuda_create_tensor`, and `fluid_cuda_mult`.  Please be aware
 that each function could just define a C++ instance of an operator and
 run it.  For example
 ```c++
 paddle::Tensor fluid_cuda_read(...) {
  paddle::Tensor t;
  paddle::operator::Read r(&t, ...);
  r.Run();
  return t;
 }
 ```
 For computational operators that have multiple *kernels*, each for a
 specific hardware platform, for example, the `mult` operator, the
 generated code should call its CUDA kernel:
 ```c++
 paddle::Tensor fluid_cuda_mult(const paddle::Tensor& a, 
                               const paddle::Tensor& b) {
  paddle::Tensor t;
  paddle::operator::Mult m(a, b, ...);
  Mult.Run(cuda_context);
 }
 ```
 where `cuda_context` could be a global variable of type
 `paddle::CUDADeviceContext`.
 ## Multi-Block Code Generation
 Most Fluid application programs may have more than one blocks.  To
 execute them, we need to trace [scopes](scope.md).