# Build PaddlePaddle for Raspberry Pi
You may use any of the following two approaches to build the inference library of PaddlePaddle for Raspberry Pi:
1. Build using SSH: Log in to a Raspberry Pi using SSH and build the library. The required development tools and third-party dependencies are listed in here: [`/Dockerfile`](
1. Cross-compile: We talk about how to cross-compile PaddlePaddle for Raspberry Pi on a Linux/x64 machine, in more detail in this article.
## The Cross-Compiling Toolchain
Step 1. Clone the Github repo by running the following command.
git clone
Step 2. Use the pre-built cross-compiler found in `./tools/tree/master/arm-bcm2708/gcc-linaro-arm-linux-gnueabihf-raspbian-x64`. To run it on a Linux computer, glibc version >= 2.14 is needed.
## CMake Arguments
CMake supports [cross-compiling]( All CMake configuration arguments required for the cross-compilation for Raspberry Pi can be found in [`cmake/cross_compiling/raspberry_pi.cmake`](
Some important arguments that need to be set:
- `CMAKE_SYSTEM_NAME`: The target platform. Must be `RPi`.
- `RPI_TOOLCHAIN`: The absolute path of the cross-compiling toolchain.
- `RPI_ARM_NEON`: Use ARM NEON Intrinsics. This is a required argument and set default to `ON`.
- `HOST_C/CXX_COMPILER`: The C/C++ compiler for the host. It is used to build building tools running on the host, for example, protoc.
A commonly-used CMake configuration is as follows:
-DRPI_TOOLCHAIN=your/path/to/arm-bcm2708/gcc-linaro-arm-linux-gnueabihf-raspbian-x64 \
-DCMAKE_INSTALL_PREFIX=your/path/to/install \
To build the inference library, please set the argument WITH\_C\_API to ON: `WITH_C_API=ON`.
You can add more arguments. For example, to minimize the size of the generated inference library, you may use `CMAKE_BUILD_TYPE=MinSizeRel`. For performance optimization, you may use `CMAKE_BUILD_TYPE=Release`.
## Build and Install
The following commands build the inference library of PaddlePaddle for Raspberry Pi and third-party dependencies.
make install
The intermediate files will be stored in `build`. Third-party libraries will be located in `build/third_party`. If you have already built it for other platforms like Android or iOS, you may want to clear these directories by running the command: `rm -rf build`.
The infernece library will be in `your/path/to/install/lib`, with related header files in `your/path/to/install/include`.
# Cluster bootstrapping tool survey
## Abstract
In order to bring up a cluster from bare metal machine to a fully functional kubernetes cluster for Paddlepaddle to run, we need to utilize some tools. Here we are going to compare [Sextant]( and [Tectonic installer](
## Basic assumptions
Here are some basic assumptions before we move on to details
1. You are an administrator of a bare metal machine cluster, which means:
* you have full control to each of the machines.
* you have full control to the network which machines are connected to.
2. Machines can be booted from network with PEX or iPXE
3. You understand the [general procedure to bring up a cluster](#appendix-general-procedure-to-bring-up-a-cluster)
if your cluster is able to mark above items with checkmarks, then keep reading.
## Comparing Sextant and Tectonic installer
### Sextant
Sextant is an end2end solution to bring up a bare metal cluster to a fully functional k8s cluster, it integrates DHCP, name service, PEX, cloud-config-service, docker registry services altogether.
#### Pros
1. End2End: basically all admin need to do is to config the cluster.yaml and power on the cluster.
2. Offline cluster configuration: Sextant has 2 phases during working with it, config time and deploy time. when admin is configuring, it requires admin's machine has internet connectivity, which will download some images, etc. But in deploy time, it's completely OK to go offline since all dependencies are ready during config time.
3. docker registry integrated.
4. GPU machine took care of.
### Cons
1. k8s API server is not deployed with high availability in considering by default.
2. No grouping support.
3. No API interface, a one-off service.
### Tectonic installer
First of all, Tectonic is not free, it requires account as a step of installation, and free user can only create less than 10 nodes.
Tectonic is a suite of software which wraps around k8s and providing more utility regarding dev ops, ie,
Tectonic installer as it's named, it installs Tectonic to a bare metal cluster which means it's not totally an equivalent of Sextant. At the "booting a cluster" part, it mostly utilizes [Matchbox](, which is a general cluster bootstrapper.
Matchbox's Approach is similar to Sexstant.
### Pros
1. supports grouping machines.
2. supports running provisioning service in rtk. (not a big deal though).
3. supports http/gRPC API interface.
4. supports multi-template.
### Cons
1. Not an e2e solution to bring up a cluster, need a lot of extra work and other software.
2. [Not fully supporting]( centOS deployment yet.
## Conclusion
Sextant is a better solution overall for paddle cloud deploying to a bare metal cluster. It would be great if Sextant can also 1) deploy k8s api server with high availability by default; 2) not designed as a one-off service.
## Appendix: General procedure to bring up a cluster
It's physically impossible for a cluster admin to manually install OS and applications into cluster nodes one by one, here is what an admin would do in cloud industry:
1. setup a bootstrap machine with static IP in the cluster, which has following services:
* DHCP: assigns ip address for rest of the nodes.
* name service: to map node name to a IP
* PXE related services: the booting related info will be delivered to newly booted machines as their IP is assigned via DHCP service, PXE service will provide further booting and installing info and image with TFTP and http protocol.
* cluster config service: this is for providing cluster node with OS config via http
* optional docker registry: a built-in docker registry makes the whole cluster independent from connecting internet, and speeds up software distribution.
2. New node powers on, it will
* broadcast the request for an IP address
* DHCP server assigns the IP address, and deliver the PXE booting related info to the node.
* cluster node will request config files with booting info delivered with DHCP via the TFTP service, and in most of the cases, the config file will point to a http service for the booting image.
* Since PXE is configured with initrd, it will utilize the cloud config service and do further installations like coreOS or K8s installations.
* then restart the node.
For further understanding, following 2 links from Matchbox are some good readings:
* [Machine lifecycle](
* [PXE booting](
# Operator fusion
Fusing multiple operators together is an important method to optimize the program execution, particularly for GPU or other specialized accelerators. An obvious benefit is to avoid the overhead of saving the intermediate result back into global memory.
There are generally two ways to fuse operators, fusing directly connected operators and fusing non directly connected operators. The first method is mainly used by [NNVM Compiler]( and [XLA]( The second method is mainly used by Dynet and TensorFlow Fold to do auto-batching. The principle of fusing operator is according to some rules to combine multiple operations into one, for example, `Y = X * W` and `Z = Y + B` can be fused to `Z = X * W + B`, and `Y1 = X1 * W` and `Y2 = X2 * W` can be fused to `[Y1;Y2] = [X1;X2] * W`. In order to get a short-term profit, we decided to try to manually specify these rules.
## Challenge
The challenge of fusing operators is:
- how to make the rules.
- how to implement these rules efficiently.
### How to make the rules?
The problem of determining the best single location for a fusion operator is an NP-hard combinatorial problem. After analysis the operators of the DL model, we found there are two group of operators can be fused explicitly, one is the simple and adjacent operations, for example, `tmp = x + y` and `z = Relu(tmp)`, and the other is the operators that have the same function, for example, a serials of `SGD` or `Momentum`. They usually appear in the model in a large number. So we should think about how to fuse them separately first.
### How to implement these rules efficiently?
#### How to fuse the adjacent operations efficiently?
Here we use a template function to represent the fused operations. The pros of using a template function are that it is simple and efficient, and the cons are that it is not easy to expand, and it can only be used to express some simple operations. So taking into account our current needs, the template function is more appropriate.
#### How to fuse the operators that have the same function efficiently?
We take SGD operator as an example, the training model may have hundreds of parameters and correspondingly have the same number of SGD operators. The expression(`w = w - lr*w_g`) of those operators is the same, so during of training, the executor will execute this expression hundreds time in CPU or other specialized accelerators. If we can fuse them and make the address of all `w` and all `w_g` continuous respectively, we only need execute one time. For some accelerators, the time of launching kernel is not neglected, so the time of hundreds of times of launching and executing kernel may be larger than launching and executing only once. There usually are many operators that similar to `SGD` in the DL model, such as `AllReduce` and `FC`.
