Paddle/doc/fluid/howto/optimization/cpu_profiling_en.md

This tutorial introduces techniques we use to profile and tune the
CPU performance of PaddlePaddle.  We will use Python packages
`cProfile` and `yep`, and Google's `perftools`.

Profiling is the process that reveals performance bottlenecks,
which could be very different from what's in the developers' mind.
Performance tuning is done to fix these bottlenecks. Performance optimization
repeats the steps of profiling and tuning alternatively.

PaddlePaddle users program AI applications by calling the Python API, which calls
into `libpaddle.so.` written in C++.  In this tutorial, we focus on
the profiling and tuning of

1. the Python code and
1. the mixture of Python and C++ code.

## Profiling the Python Code

### Generate the Performance Profiling File

We can use Python standard
package, [`cProfile`](https://docs.python.org/2/library/profile.html),
to generate Python profiling file.  For example:

```bash
python -m cProfile -o profile.out main.py
```

where `main.py` is the program we are going to profile, `-o` specifies
the output file.  Without `-o`, `cProfile` would outputs to standard
output.

### Look into the Profiling File

`cProfile` generates `profile.out` after `main.py` completes. We can
use [`cprofilev`](https://github.com/ymichael/cprofilev) to look into
the details:

```bash
cprofilev -a 0.0.0.0 -p 3214 -f profile.out main.py
```

where `-a` specifies the HTTP IP, `-p` specifies the port, `-f`
specifies the profiling file, and `main.py` is the source file.

Open the Web browser and points to the local IP and the specifies
port, we will see the output like the following:

```
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.284    0.284   29.514   29.514 main.py:1(<module>)
     4696    0.128    0.000   15.748    0.003 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/fluid/executor.py:20(run)
     4696   12.040    0.003   12.040    0.003 {built-in method run}
        1    0.144    0.144    6.534    6.534 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/__init__.py:14(<module>)
```

where each line corresponds to Python function, and the meaning of
each column is as follows:

| column | meaning |
| --- | --- |
| ncalls | the number of calls into a function |
| tottime | the total execution time of the function, not including the execution time of other functions called by the function |
| percall | tottime divided by ncalls |
| cumtime | the total execution time of the function, including the execution time of other functions being called |
| percall | cumtime divided by ncalls |
| filename:lineno(function) | where the function is defined |

### Identify Performance Bottlenecks

Usually, `tottime` and the related `percall` time is what we want to
focus on. We can sort above profiling file by tottime:

```text
     4696   12.040    0.003   12.040    0.003 {built-in method run}
   300005    0.874    0.000    1.681    0.000 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/dataset/mnist.py:38(reader)
   107991    0.676    0.000    1.519    0.000 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/fluid/framework.py:219(__init__)
     4697    0.626    0.000    2.291    0.000 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/fluid/framework.py:428(sync_with_cpp)
        1    0.618    0.618    0.618    0.618 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/fluid/__init__.py:1(<module>)
```

We can see that the most time-consuming function is the `built-in
method run`, which is a C++ function in `libpaddle.so`.  We will
explain how to profile C++ code in the next section.  At this 
moment, let's look into the third function `sync_with_cpp`, which is a
Python function.  We can click it to understand more about it:

```
Called By:

   Ordered by: internal time
   List reduced from 4497 to 2 due to restriction <'sync_with_cpp'>

Function                                                                                                 was called by...
                                                                                                             ncalls  tottime  cumtime
/home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/fluid/framework.py:428(sync_with_cpp)  <-    4697    0.626    2.291  /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/fluid/framework.py:562(sync_with_cpp)
/home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/fluid/framework.py:562(sync_with_cpp)  <-    4696    0.019    2.316  /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/fluid/framework.py:487(clone)
                                                                                                                  1    0.000    0.001  /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/fluid/framework.py:534(append_backward)


Called:

   Ordered by: internal time
   List reduced from 4497 to 2 due to restriction <'sync_with_cpp'>
```

The lists of the callers of `sync_with_cpp` might help us understand
how to improve the function definition.

## Profiling Python and C++ Code

### Generate the Profiling File

To profile a mixture of Python and C++ code, we can use a Python
package, `yep`, that can work with Google's `perftools`, which is a
commonly-used profiler for C/C++ code.

In Ubuntu systems, we can install `yep` and `perftools` by running the
following commands:

```bash
apt update
apt install libgoogle-perftools-dev
pip install yep
```

Then we can run the following command

```bash
python -m yep -v main.py
```

to generate the profiling file.  The default filename is
`main.py.prof`.

Please be aware of the `-v` command line option, which prints the
analysis results after generating the profiling file.  By examining the
 the print result, we'd know that if we stripped debug
information from `libpaddle.so` at build time.  The following hints
help make sure that the analysis results are readable:

1. Use GCC command line option `-g` when building `libpaddle.so` so to
   include the debug information.  The standard building system of
   PaddlePaddle is CMake, so you might want to set
   `CMAKE_BUILD_TYPE=RelWithDebInfo`.

1. Use GCC command line option `-O2` or `-O3` to generate optimized
   binary code. It doesn't make sense to profile `libpaddle.so`
   without optimization, because it would anyway run slowly.

1. Profiling the single-threaded binary file before the
   multi-threading version, because the latter often generates tangled
   profiling analysis result.  You might want to set environment
   variable `OMP_NUM_THREADS=1` to prevents OpenMP from automatically
   starting multiple threads.

### Examining the Profiling File

The tool we used to examine the profiling file generated by
`perftools` is [`pprof`](https://github.com/google/pprof), which
provides a Web-based GUI like `cprofilev`.

We can rely on the standard Go toolchain to retrieve the source code
of `pprof` and build it:

```bash
go get github.com/google/pprof
```

Then we can use it to profile `main.py.prof` generated in the previous
section:

```bash
pprof -http=0.0.0.0:3213 `which python`  ./main.py.prof
```

Where `-http` specifies the IP and port of the HTTP service.
Directing our Web browser to the service, we would see something like
the following:

![result](./pprof_1.png)

### Identifying the Performance Bottlenecks

Similar to how we work with `cprofilev`, we'd focus on `tottime` and
`cumtime`.

![kernel_perf](./pprof_2.png)

We can see that the execution time of multiplication and the computing
of the gradient of multiplication takes 2% to 4% of the total running
time, and `MomentumOp` takes about 17%. Obviously, we'd want to
optimize `MomentumOp`.

`pprof` would mark performance critical parts of the program in
red. It's a good idea to follow the hints.
Polishing the cpu profiling doc (#6116) 7 years ago			`This tutorial introduces techniques we use to profile and tune the`
Translate the CPU profiling document (#6073) * Translate the CPU profiling document * Paragraphing 7 years ago			`CPU performance of PaddlePaddle. We will use Python packages`
Polishing the cpu profiling doc (#6116) 7 years ago			`cProfile` and `yep`, and Google's `perftools`.
Feature/cpu profiling (#5895) * Add documentation of cProfile tools * Complete doc * Refine code 7 years ago
Polishing the cpu profiling doc (#6116) 7 years ago			`Profiling is the process that reveals performance bottlenecks,`
Translate the CPU profiling document (#6073) * Translate the CPU profiling document * Paragraphing 7 years ago			`which could be very different from what's in the developers' mind.`
Polishing the cpu profiling doc (#6116) 7 years ago			`Performance tuning is done to fix these bottlenecks. Performance optimization`
Translate the CPU profiling document (#6073) * Translate the CPU profiling document * Paragraphing 7 years ago			`repeats the steps of profiling and tuning alternatively.`
Feature/cpu profiling (#5895) * Add documentation of cProfile tools * Complete doc * Refine code 7 years ago
Polishing the cpu profiling doc (#6116) 7 years ago			`PaddlePaddle users program AI applications by calling the Python API, which calls`
Translate the CPU profiling document (#6073) * Translate the CPU profiling document * Paragraphing 7 years ago			into `libpaddle.so.` written in C++. In this tutorial, we focus on
			`the profiling and tuning of`
Feature/cpu profiling (#5895) * Add documentation of cProfile tools * Complete doc * Refine code 7 years ago
Translate the CPU profiling document (#6073) * Translate the CPU profiling document * Paragraphing 7 years ago			`1. the Python code and`
			`1. the mixture of Python and C++ code.`
Feature/cpu profiling (#5895) * Add documentation of cProfile tools * Complete doc * Refine code 7 years ago
Translate the CPU profiling document (#6073) * Translate the CPU profiling document * Paragraphing 7 years ago			`## Profiling the Python Code`
Feature/cpu profiling (#5895) * Add documentation of cProfile tools * Complete doc * Refine code 7 years ago
Translate the CPU profiling document (#6073) * Translate the CPU profiling document * Paragraphing 7 years ago			`### Generate the Performance Profiling File`
Feature/cpu profiling (#5895) * Add documentation of cProfile tools * Complete doc * Refine code 7 years ago
Translate the CPU profiling document (#6073) * Translate the CPU profiling document * Paragraphing 7 years ago			`We can use Python standard`
			package, [`cProfile`](https://docs.python.org/2/library/profile.html),
			`to generate Python profiling file. For example:`
Feature/cpu profiling (#5895) * Add documentation of cProfile tools * Complete doc * Refine code 7 years ago
			```bash
			`python -m cProfile -o profile.out main.py`
			```

Translate the CPU profiling document (#6073) * Translate the CPU profiling document * Paragraphing 7 years ago			where `main.py` is the program we are going to profile, `-o` specifies
			the output file. Without `-o`, `cProfile` would outputs to standard
			`output.`
Feature/cpu profiling (#5895) * Add documentation of cProfile tools * Complete doc * Refine code 7 years ago
Translate the CPU profiling document (#6073) * Translate the CPU profiling document * Paragraphing 7 years ago			`### Look into the Profiling File`
Feature/cpu profiling (#5895) * Add documentation of cProfile tools * Complete doc * Refine code 7 years ago
Translate the CPU profiling document (#6073) * Translate the CPU profiling document * Paragraphing 7 years ago			`cProfile` generates `profile.out` after `main.py` completes. We can
			use [`cprofilev`](https://github.com/ymichael/cprofilev) to look into
			`the details:`
Feature/cpu profiling (#5895) * Add documentation of cProfile tools * Complete doc * Refine code 7 years ago
			```bash
			`cprofilev -a 0.0.0.0 -p 3214 -f profile.out main.py`
			```

Translate the CPU profiling document (#6073) * Translate the CPU profiling document * Paragraphing 7 years ago			where `-a` specifies the HTTP IP, `-p` specifies the port, `-f`
			specifies the profiling file, and `main.py` is the source file.
Feature/cpu profiling (#5895) * Add documentation of cProfile tools * Complete doc * Refine code 7 years ago
Translate the CPU profiling document (#6073) * Translate the CPU profiling document * Paragraphing 7 years ago			`Open the Web browser and points to the local IP and the specifies`
			`port, we will see the output like the following:`
Feature/cpu profiling (#5895) * Add documentation of cProfile tools * Complete doc * Refine code 7 years ago
Translate the CPU profiling document (#6073) * Translate the CPU profiling document * Paragraphing 7 years ago			```
Feature/cpu profiling (#5895) * Add documentation of cProfile tools * Complete doc * Refine code 7 years ago			`ncalls tottime percall cumtime percall filename:lineno(function)`
			`1 0.284 0.284 29.514 29.514 main.py:1(<module>)`
move paddle/v2/fluid to paddle/fluid in documentation 7 years ago			`4696 0.128 0.000 15.748 0.003 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/fluid/executor.py:20(run)`
Feature/cpu profiling (#5895) * Add documentation of cProfile tools * Complete doc * Refine code 7 years ago			`4696 12.040 0.003 12.040 0.003 {built-in method run}`
			`1 0.144 0.144 6.534 6.534 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/__init__.py:14(<module>)`
			```

Translate the CPU profiling document (#6073) * Translate the CPU profiling document * Paragraphing 7 years ago			`where each line corresponds to Python function, and the meaning of`
			`each column is as follows:`
Feature/cpu profiling (#5895) * Add documentation of cProfile tools * Complete doc * Refine code 7 years ago
Translate the CPU profiling document (#6073) * Translate the CPU profiling document * Paragraphing 7 years ago			`\| column \| meaning \|`
Feature/cpu profiling (#5895) * Add documentation of cProfile tools * Complete doc * Refine code 7 years ago			`\| --- \| --- \|`
Translate the CPU profiling document (#6073) * Translate the CPU profiling document * Paragraphing 7 years ago			`\| ncalls \| the number of calls into a function \|`
Update cpu_profiling.md (#7803) 7 years ago			`\| tottime \| the total execution time of the function, not including the execution time of other functions called by the function \|`
Translate the CPU profiling document (#6073) * Translate the CPU profiling document * Paragraphing 7 years ago			`\| percall \| tottime divided by ncalls \|`
			`\| cumtime \| the total execution time of the function, including the execution time of other functions being called \|`
			`\| percall \| cumtime divided by ncalls \|`
			`\| filename:lineno(function) \| where the function is defined \|`
Feature/cpu profiling (#5895) * Add documentation of cProfile tools * Complete doc * Refine code 7 years ago
Translate the CPU profiling document (#6073) * Translate the CPU profiling document * Paragraphing 7 years ago			`### Identify Performance Bottlenecks`
Feature/cpu profiling (#5895) * Add documentation of cProfile tools * Complete doc * Refine code 7 years ago
Translate the CPU profiling document (#6073) * Translate the CPU profiling document * Paragraphing 7 years ago			Usually, `tottime` and the related `percall` time is what we want to
			`focus on. We can sort above profiling file by tottime:`
Feature/cpu profiling (#5895) * Add documentation of cProfile tools * Complete doc * Refine code 7 years ago
			```text
			`4696 12.040 0.003 12.040 0.003 {built-in method run}`
			`300005 0.874 0.000 1.681 0.000 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/dataset/mnist.py:38(reader)`
move paddle/v2/fluid to paddle/fluid in documentation 7 years ago			`107991 0.676 0.000 1.519 0.000 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/fluid/framework.py:219(__init__)`
			`4697 0.626 0.000 2.291 0.000 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/fluid/framework.py:428(sync_with_cpp)`
			`1 0.618 0.618 0.618 0.618 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/fluid/__init__.py:1(<module>)`
Feature/cpu profiling (#5895) * Add documentation of cProfile tools * Complete doc * Refine code 7 years ago			```

Translate the CPU profiling document (#6073) * Translate the CPU profiling document * Paragraphing 7 years ago			We can see that the most time-consuming function is the `built-in
			method run`, which is a C++ function in `libpaddle.so`. We will
Polishing the cpu profiling doc (#6116) 7 years ago			`explain how to profile C++ code in the next section. At this`
Translate the CPU profiling document (#6073) * Translate the CPU profiling document * Paragraphing 7 years ago			moment, let's look into the third function `sync_with_cpp`, which is a
			`Python function. We can click it to understand more about it:`
Feature/cpu profiling (#5895) * Add documentation of cProfile tools * Complete doc * Refine code 7 years ago
Translate the CPU profiling document (#6073) * Translate the CPU profiling document * Paragraphing 7 years ago			```
Feature/cpu profiling (#5895) * Add documentation of cProfile tools * Complete doc * Refine code 7 years ago			`Called By:`

			`Ordered by: internal time`
			`List reduced from 4497 to 2 due to restriction <'sync_with_cpp'>`

			`Function was called by...`
			`ncalls tottime cumtime`
move paddle/v2/fluid to paddle/fluid in documentation 7 years ago			`/home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/fluid/framework.py:428(sync_with_cpp) <- 4697 0.626 2.291 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/fluid/framework.py:562(sync_with_cpp)`
			`/home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/fluid/framework.py:562(sync_with_cpp) <- 4696 0.019 2.316 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/fluid/framework.py:487(clone)`
			`1 0.000 0.001 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/fluid/framework.py:534(append_backward)`
Feature/cpu profiling (#5895) * Add documentation of cProfile tools * Complete doc * Refine code 7 years ago

			`Called:`

			`Ordered by: internal time`
			`List reduced from 4497 to 2 due to restriction <'sync_with_cpp'>`
			```

Translate the CPU profiling document (#6073) * Translate the CPU profiling document * Paragraphing 7 years ago			The lists of the callers of `sync_with_cpp` might help us understand
			`how to improve the function definition.`
Feature/cpu profiling (#5895) * Add documentation of cProfile tools * Complete doc * Refine code 7 years ago
Translate the CPU profiling document (#6073) * Translate the CPU profiling document * Paragraphing 7 years ago			`## Profiling Python and C++ Code`
Feature/cpu profiling (#5895) * Add documentation of cProfile tools * Complete doc * Refine code 7 years ago
Translate the CPU profiling document (#6073) * Translate the CPU profiling document * Paragraphing 7 years ago			`### Generate the Profiling File`
Feature/cpu profiling (#5895) * Add documentation of cProfile tools * Complete doc * Refine code 7 years ago
Translate the CPU profiling document (#6073) * Translate the CPU profiling document * Paragraphing 7 years ago			`To profile a mixture of Python and C++ code, we can use a Python`
			package, `yep`, that can work with Google's `perftools`, which is a
			`commonly-used profiler for C/C++ code.`
Feature/cpu profiling (#5895) * Add documentation of cProfile tools * Complete doc * Refine code 7 years ago
Translate the CPU profiling document (#6073) * Translate the CPU profiling document * Paragraphing 7 years ago			In Ubuntu systems, we can install `yep` and `perftools` by running the
			`following commands:`
Feature/cpu profiling (#5895) * Add documentation of cProfile tools * Complete doc * Refine code 7 years ago
			```bash
Translate the CPU profiling document (#6073) * Translate the CPU profiling document * Paragraphing 7 years ago			`apt update`
Feature/cpu profiling (#5895) * Add documentation of cProfile tools * Complete doc * Refine code 7 years ago			`apt install libgoogle-perftools-dev`
			`pip install yep`
			```

Translate the CPU profiling document (#6073) * Translate the CPU profiling document * Paragraphing 7 years ago			`Then we can run the following command`
Feature/cpu profiling (#5895) * Add documentation of cProfile tools * Complete doc * Refine code 7 years ago
			```bash
			`python -m yep -v main.py`
			```

Translate the CPU profiling document (#6073) * Translate the CPU profiling document * Paragraphing 7 years ago			`to generate the profiling file. The default filename is`
			`main.py.prof`.

			Please be aware of the `-v` command line option, which prints the
Polishing the cpu profiling doc (#6116) 7 years ago			`analysis results after generating the profiling file. By examining the`
			`the print result, we'd know that if we stripped debug`
Translate the CPU profiling document (#6073) * Translate the CPU profiling document * Paragraphing 7 years ago			information from `libpaddle.so` at build time. The following hints
			`help make sure that the analysis results are readable:`
Feature/cpu profiling (#5895) * Add documentation of cProfile tools * Complete doc * Refine code 7 years ago
Translate the CPU profiling document (#6073) * Translate the CPU profiling document * Paragraphing 7 years ago			1. Use GCC command line option `-g` when building `libpaddle.so` so to
			`include the debug information. The standard building system of`
			`PaddlePaddle is CMake, so you might want to set`
			`CMAKE_BUILD_TYPE=RelWithDebInfo`.
Feature/cpu profiling (#5895) * Add documentation of cProfile tools * Complete doc * Refine code 7 years ago
Translate the CPU profiling document (#6073) * Translate the CPU profiling document * Paragraphing 7 years ago			1. Use GCC command line option `-O2` or `-O3` to generate optimized
			binary code. It doesn't make sense to profile `libpaddle.so`
			`without optimization, because it would anyway run slowly.`
Feature/cpu profiling (#5895) * Add documentation of cProfile tools * Complete doc * Refine code 7 years ago
Translate the CPU profiling document (#6073) * Translate the CPU profiling document * Paragraphing 7 years ago			`1. Profiling the single-threaded binary file before the`
			`multi-threading version, because the latter often generates tangled`
			`profiling analysis result. You might want to set environment`
			variable `OMP_NUM_THREADS=1` to prevents OpenMP from automatically
			`starting multiple threads.`
Feature/cpu profiling (#5895) * Add documentation of cProfile tools * Complete doc * Refine code 7 years ago
Polishing the cpu profiling doc (#6116) 7 years ago			`### Examining the Profiling File`
Feature/cpu profiling (#5895) * Add documentation of cProfile tools * Complete doc * Refine code 7 years ago
Polishing the cpu profiling doc (#6116) 7 years ago			`The tool we used to examine the profiling file generated by`
Translate the CPU profiling document (#6073) * Translate the CPU profiling document * Paragraphing 7 years ago			`perftools` is [`pprof`](https://github.com/google/pprof), which
			provides a Web-based GUI like `cprofilev`.

			`We can rely on the standard Go toolchain to retrieve the source code`
			of `pprof` and build it:
Feature/cpu profiling (#5895) * Add documentation of cProfile tools * Complete doc * Refine code 7 years ago
			```bash
			`go get github.com/google/pprof`
			```

Translate the CPU profiling document (#6073) * Translate the CPU profiling document * Paragraphing 7 years ago			Then we can use it to profile `main.py.prof` generated in the previous
			`section:`
Feature/cpu profiling (#5895) * Add documentation of cProfile tools * Complete doc * Refine code 7 years ago
			```bash
			pprof -http=0.0.0.0:3213 `which python` ./main.py.prof
			```

Translate the CPU profiling document (#6073) * Translate the CPU profiling document * Paragraphing 7 years ago			Where `-http` specifies the IP and port of the HTTP service.
			`Directing our Web browser to the service, we would see something like`
			`the following:`
Feature/cpu profiling (#5895) * Add documentation of cProfile tools * Complete doc * Refine code 7 years ago
			`![result](./pprof_1.png)`

Translate the CPU profiling document (#6073) * Translate the CPU profiling document * Paragraphing 7 years ago			`### Identifying the Performance Bottlenecks`
Feature/cpu profiling (#5895) * Add documentation of cProfile tools * Complete doc * Refine code 7 years ago
Translate the CPU profiling document (#6073) * Translate the CPU profiling document * Paragraphing 7 years ago			Similar to how we work with `cprofilev`, we'd focus on `tottime` and
			`cumtime`.
Feature/cpu profiling (#5895) * Add documentation of cProfile tools * Complete doc * Refine code 7 years ago
			`![kernel_perf](./pprof_2.png)`

Translate the CPU profiling document (#6073) * Translate the CPU profiling document * Paragraphing 7 years ago			`We can see that the execution time of multiplication and the computing`
			`of the gradient of multiplication takes 2% to 4% of the total running`
			time, and `MomentumOp` takes about 17%. Obviously, we'd want to
			optimize `MomentumOp`.
Feature/cpu profiling (#5895) * Add documentation of cProfile tools * Complete doc * Refine code 7 years ago
Translate the CPU profiling document (#6073) * Translate the CPU profiling document * Paragraphing 7 years ago			`pprof` would mark performance critical parts of the program in
Polishing the cpu profiling doc (#6116) 7 years ago			`red. It's a good idea to follow the hints.`