@ -0,0 +1,62 @@
|
||||
set -e
|
||||
|
||||
function clock_to_seconds() {
|
||||
hours=`echo $1 | awk -F ':' '{print $1}'`
|
||||
mins=`echo $1 | awk -F ':' '{print $2}'`
|
||||
secs=`echo $1 | awk -F ':' '{print $3}'`
|
||||
echo `awk 'BEGIN{printf "%.2f",('$secs' + '$mins' * 60 + '$hours' * 3600)}'`
|
||||
}
|
||||
|
||||
function infer() {
|
||||
unset OMP_NUM_THREADS MKL_NUM_THREADS OMP_DYNAMIC KMP_AFFINITY
|
||||
topology=$1
|
||||
layer_num=$2
|
||||
bs=$3
|
||||
thread=`nproc`
|
||||
if [ $thread -gt $bs ]; then
|
||||
thread=$bs
|
||||
fi
|
||||
log="logs/infer-${topology}-${layer_num}-${thread}openblas-${bs}.log"
|
||||
|
||||
models_in="models/${topology}-${layer_num}/pass-00000/"
|
||||
if [ ! -d $models_in ]; then
|
||||
echo "./run_mkl_infer.sh to save the model first"
|
||||
exit 0
|
||||
fi
|
||||
log_period=$((256 / bs))
|
||||
paddle train --job=test \
|
||||
--config="${topology}.py" \
|
||||
--use_gpu=False \
|
||||
--trainer_count=$thread \
|
||||
--log_period=$log_period \
|
||||
--config_args="batch_size=${bs},layer_num=${layer_num},is_infer=True" \
|
||||
--init_model_path=$models_in \
|
||||
2>&1 | tee ${log}
|
||||
|
||||
# calculate the last 5 logs period time of 1280 samples,
|
||||
# the time before are burning time.
|
||||
start=`tail ${log} -n 7 | head -n 1 | awk -F ' ' '{print $2}' | xargs`
|
||||
end=`tail ${log} -n 2 | head -n 1 | awk -F ' ' '{print $2}' | xargs`
|
||||
start_sec=`clock_to_seconds $start`
|
||||
end_sec=`clock_to_seconds $end`
|
||||
fps=`awk 'BEGIN{printf "%.2f",(1280 / ('$end_sec' - '$start_sec'))}'`
|
||||
echo "Last 1280 samples start: ${start}(${start_sec} sec), end: ${end}(${end_sec} sec;" >> ${log}
|
||||
echo "FPS: $fps images/sec" 2>&1 | tee -a ${log}
|
||||
}
|
||||
|
||||
if [ ! -f "train.list" ]; then
|
||||
echo " " > train.list
|
||||
fi
|
||||
if [ ! -f "test.list" ]; then
|
||||
echo " " > test.list
|
||||
fi
|
||||
if [ ! -d "logs" ]; then
|
||||
mkdir logs
|
||||
fi
|
||||
|
||||
# inference benchmark
|
||||
for batchsize in 1 2 4 8 16; do
|
||||
infer googlenet v1 $batchsize
|
||||
infer resnet 50 $batchsize
|
||||
infer vgg 19 $batchsize
|
||||
done
|
@ -0,0 +1,39 @@
|
||||
set -e
|
||||
|
||||
function train() {
|
||||
unset OMP_NUM_THREADS MKL_NUM_THREADS OMP_DYNAMIC KMP_AFFINITY
|
||||
topology=$1
|
||||
layer_num=$2
|
||||
bs=$3
|
||||
thread=`nproc`
|
||||
# each trainer_count use only 1 core to avoid conflict
|
||||
log="logs/train-${topology}-${layer_num}-${thread}openblas-${bs}.log"
|
||||
args="batch_size=${bs},layer_num=${layer_num}"
|
||||
config="${topology}.py"
|
||||
paddle train --job=time \
|
||||
--config=$config \
|
||||
--use_gpu=False \
|
||||
--trainer_count=$thread \
|
||||
--log_period=10 \
|
||||
--test_period=100 \
|
||||
--config_args=$args \
|
||||
2>&1 | tee ${log}
|
||||
|
||||
avg_time=`tail ${log} -n 1 | awk -F ' ' '{print $8}' | sed 's/avg=//'`
|
||||
fps=`awk 'BEGIN{printf "%.2f",('$bs' / '$avg_time' * 1000)}'`
|
||||
echo "FPS: $fps images/sec" 2>&1 | tee -a ${log}
|
||||
}
|
||||
|
||||
if [ ! -f "train.list" ]; then
|
||||
echo " " > train.list
|
||||
fi
|
||||
if [ ! -d "logs" ]; then
|
||||
mkdir logs
|
||||
fi
|
||||
|
||||
# training benchmark
|
||||
for batchsize in 64 128 256; do
|
||||
train vgg 19 $batchsize
|
||||
train resnet 50 $batchsize
|
||||
train googlenet v1 $batchsize
|
||||
done
|
@ -1,23 +1,29 @@
|
||||
# Executor Design Doc
|
||||
|
||||
## Motivation
|
||||
In [fluid](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/fluid.md), we encourage the user to use deep learning programming paradigms to describe the training process. When the user-written Python program is executed, it will first create a protobuf message
|
||||
[`ProgramDesc`](https://github.com/PaddlePaddle/Paddle/blob/a91efdde6910ce92a78e3aa7157412c4c88d9ee8/paddle/framework/framework.proto#L145) that describes the process and is conceptually like an [abstract syntax tree](https://en.wikipedia.org/wiki/Abstract_syntax_tree).
|
||||
|
||||
We use executor to do the runtime evaluation of a `ProgramDesc`.
|
||||
The executor runs the `ProgramDesc` like an interpreter. `ProgramDesc` contains the intrinsics (operators in this case) and variables which will be used, executor explicitly executes the stored precompiled code.
|
||||
|
||||
## Overview
|
||||
|
||||
An executor takes a `ProgramDesc`, a `block_id` and a `Scope`. The `ProgramDesc` is a list of blocks and each block contains the protobuf definition of all the parameters and operators. The `block_id` specifies the entrance block. And the `Scope` is the container of all the variable instance, which is persistent throughout different runs.
|
||||
An executor takes a `ProgramDesc`, a `block_id` and a `Scope`. The `ProgramDesc` is a list of blocks and each block contains the protobuf definition of all the parameters and operators in the block. The `block_id` specifies the entrance block. And the `Scope` is the container of all the variable instances, which is persistent throughout different runs.
|
||||
|
||||
### What does executor do?
|
||||
## Executor
|
||||
|
||||
It evaluates all the operators in the `block_id`th block of a `ProgramDesc`.
|
||||
The `Executor` explicitly executes all the intrinsics (operators here) in the `block_id`th block of a `ProgramDesc`. Essentially, it instantiates Variables and Operators, then runs all the operators in sequence one-by-one.
|
||||
It is very similar to how a push stack frame works when entering a block, following which it cleans up all the temporary variables when a mini-batch is finished. It does not however, have the stack frame pop process.
|
||||
|
||||
### What does executor NOT do?
|
||||
### The interface
|
||||
```c++
|
||||
Executor(places);
|
||||
```
|
||||
A executor does not own any computing resources, a user can only construct an executor using the specified places.
|
||||
|
||||
It does not do runtime optimization, meaning intelligently parse the dependency of each op a choose which one to be run and in which order they should be run.
|
||||
### Running an Executor
|
||||
|
||||
It does not do graph partitioning, meaning dividing the `ProgramDesc` into several small pieces and executing them on different devices.
|
||||
|
||||
## Implementation
|
||||
|
||||
`Executor` evaluates a `ProgramDesc`. Essentially, it instantiates Variables and Operators, then run all the operators in sequence. [[code]](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/executor.cc)
|
||||
```
|
||||
void Run(ProgramDesc, Scope, block_id, create_local_scope);
|
||||
```
|
||||
An `Executor` only provides a unified way to execute `ProgramDesc`. `ProgramDesc` is the target that will be executed, the `Scope` specifies the variable container, the `block_id` indicates the entrance block and `create_local_scope` is a boolean that states whether it will destroy the temporary variables after the execution is finished.
|
||||
|
After Width: | Height: | Size: 108 KiB |
After Width: | Height: | Size: 33 KiB |
Before Width: | Height: | Size: 13 KiB After Width: | Height: | Size: 13 KiB |
Before Width: | Height: | Size: 22 KiB After Width: | Height: | Size: 22 KiB |
Before Width: | Height: | Size: 11 KiB After Width: | Height: | Size: 11 KiB |
Before Width: | Height: | Size: 18 KiB After Width: | Height: | Size: 18 KiB |
Before Width: | Height: | Size: 10 KiB After Width: | Height: | Size: 10 KiB |
@ -0,0 +1,65 @@
|
||||
# Design Doc: NCCL support in Paddle Fluid
|
||||
|
||||
## Abstract
|
||||
|
||||
This Design Doc refers to the NCCL feature in paddle. We propose an approach to support NCCL library both on a single machine and multiple machines. We wrapper the NCCL primitives `Broadcast`, `Allreduce`, `Reduce` as operators to utilize Multi-GPU powers in one script.
|
||||
|
||||
|
||||
## Motivation
|
||||
|
||||
[NCCL](https://developer.nvidia.com/nccl) is a NVIDIA library support Multi-GPU communicating and optimized for NVIDIA GPUs, it provides routines such as all-gather, all-reduce, broadcast, reduce, reduce-scatter, that can achieve high bandwidth over PCIe and NVLink high-speed interconnect. With NCCL library, we can easily accelerate the training in parallel.
|
||||
|
||||
- Pros
|
||||
1. easily plug-in with [NCCL2](https://developer.nvidia.com/nccl) library.
|
||||
1. high performance in NVIDIA GPUs.
|
||||
1. MPI like primitives, which have low learning cost for users.
|
||||
|
||||
- Cons
|
||||
1. Only design for NVIDIA GPUs, not a general multi-device solution.
|
||||
1. Although NCCL1 is opensourced under BSD license, but NCCL2 is not opensourced anymore.
|
||||
|
||||
At the beginning of training, the framework needs to distribute the same parameters to every GPU, and merge the gradients at any time user interests.
|
||||
|
||||
As a result, during training, we need the operations of peer to peer copy between different GPUs, aggregating gradients/parameters from GPUs, and broadcasting parameters to GPUs. Every GPU only need to run the operator with correct place information.
|
||||
|
||||
Besides, it needs interfaces to synchronize model update with each different GPU Cards.
|
||||
|
||||
## Implementation
|
||||
|
||||
As mentioned above, we wrap the NCCL routines as several kinds of operators. Need to note that NCCL need to create Communicator between gpu at the beginning, so there is a NCCLInit operator created.
|
||||
|
||||
### Transpiler
|
||||
|
||||
To be compatible with [parameter server design doc](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/ops/dist_train.md), the transpiler compiles the user defined operation graph into sub-graphs to be executed on different devices.
|
||||
|
||||
1. The user-defined model will be a single device program
|
||||
|
||||
2. Broadcast/Reduce operators between GPUs will be inserted into the program, even for the multi-node, may insert the `Send`, `Recv` operator.
|
||||
|
||||
*Broadcast, AllReduce in a single machine. And Broadcast, AllReduce, [Send, Recv](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/ops/dist_train.md#graph-converter) in multiple machines*
|
||||
|
||||
<img src="images/multigpu_before_convert.png" width="300"/>
|
||||
|
||||
After compiling, the graph as shows
|
||||
|
||||
<img src="images/multigpu_allreduce.png" width="1000"/>
|
||||
|
||||
Operators are added to the sub-graphs. Every GPU assigned a role of `rank0`, `rank1` etc.
|
||||
|
||||
- **Broadcast**. Broadcast operator distribute initialized parameter to all the GPUs from the GPU who owns it. e.g. from`rank0` GPU.
|
||||
- **AllReduce**. AllReduce operator synchronizes parameters/gradients between GPUs. AllReduce implemented in the Ring-Based communicating method, avoid of the bottle neck in a single GPU.
|
||||
|
||||
Need to notice that AllReduce operator force GPUs synchronized at that point. The whole training process in asynchronous or synchronous mode depends on the AllReduce point in the graph.
|
||||
|
||||
As it shown in the picture, when each GPU compute the gradient of `W`, followed with a `AllReduce` operator, accumulate the `dW` to full batch of data, then run the optimize process individually and apply the gradient to its `W`.
|
||||
|
||||
- **AllReduce**
|
||||
Need to note that our AllReduce operator is a ring-base AllReduce implementation. If we use the NCCL2 AllReduce primitive, every GPU optimized full batch of data, wasted (n-1) GPU compute resources. In addition, NCCL2 built-in AllReduce will only utilize the communicating resource during synchronization, then update the gradient will be a subsequent phase. In fact, we can amortize the update gradient time cost into the communicating phase. The process is
|
||||
1. Every parameter has its root card. That card will responsible for aggregating the gradients from GPUs.
|
||||
2. The whole model's parameter will be hashed to different root card, ensure the load balance between GPUs.
|
||||
3. Logically neighberhood card will start send parameter to the next one. After one round, the parameter main card will aggregate the full gradients.
|
||||
4. Then the root card will optimize the parameter.
|
||||
5. This parameter card will send its optimized result to its neighberhood, then the neighberhood will send parameter to its next one.
|
||||
6. Finish the sychronization round.
|
||||
|
||||
The total time cost will be 2 * (n-1) * per-parameter-send-time, we reach the goal of amortize the upgrade time into communicating phase.
|
@ -0,0 +1,43 @@
|
||||
# Design Doc: Execute the Program with Multi CPU
|
||||
|
||||
## Abstract
|
||||
|
||||
This Design Doc propose an approach to make the user-defined Op graph
|
||||
running with multi-CPU, we will use an auto transpiler to convert the user-defined
|
||||
Op graph to a multi-CPU Op graph, and run `ParallelDo` Op to run the graph.
|
||||
|
||||
## Transpiler
|
||||
|
||||
<img src="src/multi-threads/single-thread@3x.png" width="300">
|
||||
|
||||
After converted:
|
||||
|
||||
<img src="src/multi-threads/multi-threads@3x.png" width="1000">
|
||||
|
||||
## Implement
|
||||
|
||||
- `Multi-CPU Transpiler` will convert the graph to a multi-CPU graph
|
||||
which would be executed with multi-threads.
|
||||
- `BlockingCounter` will `Init/Decrement` an atomic counter, and Blocking `Wait`
|
||||
for the atomic counter become `0`:
|
||||
```cpp
|
||||
BlockingCounter bc(thread_count);
|
||||
for (int i = 0; i < thread_count; ++i) {
|
||||
thread_pool->Start([&bc] {bc.DecrementCount(); })
|
||||
}
|
||||
bc.Wait();
|
||||
```
|
||||
- `ParallelDo` Operator
|
||||
- Initialize a thread pool which is a Singleton.
|
||||
- Use a block id as the input, and create run the specify Block on independent scope
|
||||
with multi-threads.
|
||||
- Initialize a `BlockingCounter` instance and wait until all threads are done.
|
||||
- `Split` Operator will split the Input Tensor into a TensorArray.
|
||||
- `Merge` merge all the gradients which calculated in different threads
|
||||
with `mean/sum/max/min...` method, and then run the Optimizer Op to optimize `W`.
|
||||
|
||||
## TODO
|
||||
|
||||
- Improve the optimizer stage with multi-threads, since we could
|
||||
assign the parameters to the different threads and execute
|
||||
optimizer with multi-threads.
|
After Width: | Height: | Size: 350 KiB |
After Width: | Height: | Size: 76 KiB |