commit
a3ba264c47
@ -0,0 +1,27 @@
|
||||
# Embed Paddle Inference in Your Application
|
||||
|
||||
Paddle inference offers the APIs in `C` and `C++` languages.
|
||||
|
||||
One can easily deploy a model trained by Paddle following the steps as below:
|
||||
|
||||
1. Optimize the native model;
|
||||
2. Write some codes for deployment.
|
||||
|
||||
|
||||
Let's explain the steps in detail.
|
||||
|
||||
## Optimize the native Fluid Model
|
||||
|
||||
The native model that get from the training phase needs to be optimized for that.
|
||||
|
||||
- Clean the noise such as the cost operators that do not need inference;
|
||||
- Prune unnecessary computation fork that has nothing to do with the output;
|
||||
- Remove extraneous variables;
|
||||
- Memory reuse for native Fluid executor;
|
||||
- Translate the model storage format to some third-party engine's, so that the inference API can utilize the engine for acceleration;
|
||||
|
||||
We have an official tool to do the optimization, call `paddle_inference_optimize --help` for more information.
|
||||
|
||||
## Write some codes
|
||||
|
||||
Read `paddle_inference_api.h` for more information.
|
@ -0,0 +1,69 @@
|
||||
/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License");
|
||||
you may not use this file except in compliance with the License.
|
||||
You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software
|
||||
distributed under the License is distributed on an "AS IS" BASIS,
|
||||
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
See the License for the specific language governing permissions and
|
||||
limitations under the License. */
|
||||
|
||||
#pragma once
|
||||
|
||||
#include <string>
|
||||
#include <vector>
|
||||
|
||||
namespace paddle {
|
||||
|
||||
class Predictor {
|
||||
public:
|
||||
struct Attr;
|
||||
Predictor() = default;
|
||||
|
||||
// Build the network before inference.
|
||||
bool Init(const Attr& attr);
|
||||
|
||||
// Predict an record.
|
||||
// Arguments:
|
||||
// inputs: the name of the input variables.
|
||||
// outputs: the name of the output varaibles.
|
||||
// input_shapes: the shape of the input variables.
|
||||
// output_shapes: the shape of the output variables.
|
||||
// input_data: the data of the input variables.
|
||||
// output_data: the data of the output variables.
|
||||
bool Run(const std::vector<std::string>& inputs,
|
||||
const std::vector<std::string>& outputs,
|
||||
const std::vector<std::vector<int>>& input_shapes,
|
||||
const std::vector<std::vector<int>>& output_shapes,
|
||||
const std::vector<std::vector<float>>& input_data,
|
||||
std::vector<std::vector<float>>* output_data);
|
||||
|
||||
// Clone a predictor that share the model weights.
|
||||
Predictor* Clone();
|
||||
|
||||
// Destroy the Predictor.
|
||||
~Predictor();
|
||||
|
||||
struct Attr {
|
||||
enum class EngineKind;
|
||||
|
||||
std::string model_dir; // path to the model directory.
|
||||
bool enable_engine{false}; // Enable to execute (part of) the model on
|
||||
// third-party engines.
|
||||
EngineKind engine_kind{Attr::EngineKind::kNone};
|
||||
|
||||
enum class EngineKind {
|
||||
kNone = -1, // Use the native Fluid facility.
|
||||
kAnakin, // Use Anakin for inference.
|
||||
kTensorRT, // Use TensorRT for inference.
|
||||
kAutoMixedAnakin, // Automatically mix Fluid with Anakin.
|
||||
kAutoMixedTensorRT, // Automatically mix Fluid with TensorRT.
|
||||
};
|
||||
};
|
||||
};
|
||||
|
||||
} // namespace paddle
|
@ -0,0 +1,110 @@
|
||||
# Distributed Training with NCCL2 and RDMA
|
||||
|
||||
When doing distributed multi-GPU training, network bandwith often becomes the
|
||||
bottle neck. We introduce a way to use NCCL2 to do such training job to
|
||||
achieve best performace.
|
||||
|
||||
## Prepare Hardwares with RDMA and Multiple GPUs
|
||||
|
||||
I'm using two Linux servers each of them is installed with 8 GPUs and
|
||||
one 100Gb RDMA card.
|
||||
Base environment is:
|
||||
|
||||
* OS: CentOS 7.4
|
||||
* RDMA device: "Mellanox Technologies MT27700 Family [ConnectX-4]"
|
||||
* Kernel version: `4.4.88-1.el7.elrepo.x86_64`
|
||||
* Docker version: `1.12.6`
|
||||
* Docker storage driver: `overlay2`
|
||||
* IP addresses: 192.168.16.30,192.168.16.34
|
||||
|
||||
In general, the steps including:
|
||||
|
||||
1. Install GPU drivers
|
||||
1. Install RDMA drivers
|
||||
1. Install "InfiniBand Support"
|
||||
1. Use docker to run tests and make sure GPUs and RDMA can work inside
|
||||
the container.
|
||||
|
||||
I'll ommit section "Install GPU drivers" because we can find it easily
|
||||
somewhere else.
|
||||
|
||||
### Install RDMA drivers
|
||||
|
||||
For my case, I've got two machines with device
|
||||
"Mellanox Technologies MT27700 Family [ConnectX-4]" installed. The OS was
|
||||
"CentOS 7.4" and I updated the kernel to version 4.4 so that docker can
|
||||
work with latest overlay2 filesystem.
|
||||
|
||||
***NOTE: before you start, make sure you have a way to get a console
|
||||
of the server other than ssh because we may need to re-configure the
|
||||
network device.***
|
||||
|
||||
1. Go to http://www.mellanox.com/page/products_dyn?product_family=26,
|
||||
download `MLNX_OFED` software in the bottom of the page, and upload it
|
||||
onto the server.
|
||||
1. Run `./mlnxofedinstall --add-kernel-support` in the software package.
|
||||
1. Run `/etc/init.d/openibd restart` to make everything work, note that
|
||||
this operation may cause the network goes down if you are using this
|
||||
RDMA device as default network device and use ssh to login the server.
|
||||
1. Re-configure the network interface, for example:
|
||||
`ifconfig eth2 192.168.16.30/20 up`, then add routes if needed:
|
||||
`ip route add default via 192.168.16.1 dev eth2`.
|
||||
1. Do the same thing on the other node.
|
||||
1. Use `ping` to test if the two nodes have typical ICMP connection.
|
||||
1. Use either `udaddy` or `ib_write_bw` to test the network connection is
|
||||
ready and have the desired bandwith.
|
||||
|
||||
### Prepare Docker Image to Run RDMA Programs
|
||||
|
||||
1. Build a docker image using cuda base image like: `nvidia/cuda:8.0-cudnn5-devel-ubuntu16.04` and install paddlepaddle whl
|
||||
package in it.
|
||||
1. Start a docker container and mount GPU driver libs into it (you can
|
||||
skip this step if you are using nvidia-docker).
|
||||
1. Mount RDMA dirvers and libs into the docker image (see below section),
|
||||
also `udaddy` and `ib_write_bw` if needed.
|
||||
1. Mount GPU devices and RDMA devices into the container using `--device`
|
||||
or just use privileged mode `--privileged`.
|
||||
1. Start the container using host network mode: `--net=host`
|
||||
|
||||
### RDMA Library Files Needed
|
||||
|
||||
Usually, `MLNX_OFED` install latest supported libs under
|
||||
`/usr/lib64/mlnx_ofed/valgrind`. Other libs also needed to run RDMA programs
|
||||
is listed below. These libs must be mounted into the docker container.
|
||||
|
||||
* Libs under `/usr/lib64/mlnx_ofed/valgrind`
|
||||
* libibcm.so
|
||||
* libibverbs.so
|
||||
* libmlx4.so
|
||||
* libmlx5.so
|
||||
* libmlx5-rdmav2.so
|
||||
* librdmacm.so
|
||||
* Other libs:
|
||||
* libnl-3.so.200
|
||||
* libnl-route-3.so.200
|
||||
* libnuma.so.1
|
||||
|
||||
## Start to Run the Training Job
|
||||
|
||||
Setting NCCL environment variables to turn NCCL switches on and off:
|
||||
|
||||
|
||||
| Env Name | Description |
|
||||
| --- | --- |
|
||||
| NCCL_SOCKET_IFNAME | The RDMA device, e.g. eth2 |
|
||||
| NCCL_P2P_DISABLE | Set to 1 to disable P2P transfer between GPUs |
|
||||
| NCCL_IB_DISABLE | Set to 1 to disable using RDMA |
|
||||
| NCCL_IB_CUDA_SUPPORT | Set to 1 to enable GPU Direct if supported |
|
||||
| NCCL_DEBUG | Set debug level: VERSION, WARN, INFO |
|
||||
|
||||
My two servers are: `192.168.16.30,192.168.16.34`, On node 1, Run :
|
||||
|
||||
```bash
|
||||
PADDLE_TRAINER_ID=0 PADDLE_PORT=48372 PADDLE_WORKERS=192.168.16.30,192.168.16.34 POD_IP=192.168.16.30 stdbuf -oL python vgg16.py
|
||||
```
|
||||
|
||||
On node 2, Run:
|
||||
|
||||
```bash
|
||||
PADDLE_TRAINER_ID=1 PADDLE_PORT=48372 PADDLE_WORKERS=192.168.16.30,192.168.16.34 POD_IP=192.168.16.34 stdbuf -oL python vgg16.py
|
||||
```
|
Some files were not shown because too many files have changed in this diff Show More
Loading…
Reference in new issue