History

cc 5d8d463cf7 Collect weight threshold for lstm op in post_training_quantization (#28701 ) * Collect weight threshold of lstm, test=develop		4 years ago
..
CMakeLists.txt	Collect weight threshold for lstm op in post_training_quantization (#28701 )	4 years ago
README.md	rename qat into quant (#24948 )	5 years ago
__init__.py	add quantization freeze pass.	6 years ago
convert_model2dot.py	Use dygraph mode by default (#27443 )	5 years ago
quant2_int8_image_classification_comparison.py	Use dygraph mode by default (#27443 )	5 years ago
quant2_int8_nlp_comparison.py	Use dygraph mode by default (#27443 )	5 years ago
quant_int8_image_classification_comparison.py	Use dygraph mode by default (#27443 )	5 years ago
save_quant_model.py	Use dygraph mode by default (#27443 )	5 years ago
test_graph.py	Use dygraph mode by default (#27443 )	5 years ago
test_imperative_out_scale.py	Quantization supports 2.0 APIs (#30036 )	4 years ago
test_imperative_qat.py	Quant nn2.0 (#28764 )	5 years ago
test_imperative_qat_addquantdequant.py	Implemented AddQuantDequantPass in imperative quantization. (#26692 )	4 years ago
test_imperative_qat_channelwise.py	Implemented AddQuantDequantPass in imperative quantization. (#26692 )	4 years ago
test_imperative_qat_user_defined.py	clean redundant API alias in 2.0 - part 1 (#29928 )	5 years ago
test_imperative_skip_op.py	Fix test_imperative_skip_out (#29939 )	5 years ago
test_moving_average_abs_max_scale_op.py	Integrate ImperativeOutScale into ImperativeQuantAware. (#27956 )	5 years ago
test_post_training_quantization_lstm_model.py	Collect weight threshold for lstm op in post_training_quantization (#28701 )	4 years ago
test_post_training_quantization_mnist.py	Use dygraph mode by default (#27443 )	5 years ago
test_post_training_quantization_mobilenetv1.py	Use dygraph mode by default (#27443 )	5 years ago
test_post_training_quantization_resnet50.py	Use dygraph mode by default (#27443 )	5 years ago
test_quant2_int8_mkldnn_pass.py	skip quantizing ops in cpu inference (#30342 )	4 years ago
test_quantization_mkldnn_pass.py	Use dygraph mode by default (#27443 )	5 years ago
test_quantization_pass.py	Use dygraph mode by default (#27443 )	5 years ago
test_quantization_scale_pass.py	Use dygraph mode by default (#27443 )	5 years ago
test_quantize_transpiler_v2.py	Support quantizing program_desc (#29526 )	5 years ago
test_user_defined_quantization.py	Use dygraph mode by default (#27443 )	5 years ago
test_weight_quantization_mobilenetv1.py	Use dygraph mode by default (#27443 )	5 years ago

README.md

SLIM Quantization-aware training (QAT) for INT8 MKL-DNN

This document describes how to use Paddle Slim to convert a quantization-aware trained model (Quant model) into INT8 MKL-DNN quantized model and run it.

In Release 1.5, we have released the first approach to the MKL-DNN-based quantization of Quant models, called Quant1. It enabled the conv2d and mul INT8 MKL-DNN kernels for Quant trained models (GoogleNet, MobileNetV1, MobileNetV2, ResNet50, ResNet101, VGG16, and VGG19) with 0.05% accuracy diff.

In Release 1.6, a new approach was introduced, called Quant2, which adds support for more performance optimizations and more INT8 MKL-DNN kernels. INT8 MKL-DNN models obtained using Quant2 have much better inference performance than using Quant1, with only a little bit bigger accuracy diff.

In Release 1.7, a support for Ernie (NLP) Quant trained model was added to the Quant2.

In Release 2.0, further optimizations were added to the Quant2: INT8 matmul kernel, inplace execution of activation and elementwise_add operators, and broader support for quantization aware strategy from PaddleSlim.

In this document we focus on the Quant2 approach only.

0. Prerequisites

PaddlePaddle in version 2.0 or higher is required. For instructions on how to install it see the installation document.
MKL-DNN and MKL are required. The highest performance gain can be observed using CPU servers supporting AVX512 instructions.
INT8 accuracy is best on CPU servers supporting AVX512 VNNI extension (e.g. CLX class Intel processors). A linux server supports AVX512 VNNI instructions if the output of the command lscpu contains the avx512_vnni entry in the Flags section. AVX512 VNNI support on Windows can be checked using the coreinfo tool.

1. Introduction

There are two approaches to quantization supported in PaddlePaddle: post-training quantization (PTQ) and quantization-aware training (QAT). Using both PTQ and QAT a user can convert models created by PaddleSlim into INT8 models and run INT8 inference on CPU. PTQ is more automatic and requires less model preparation. However, QAT usually gives better accuracy with similar performance. In this document we focus on a transformation from intermediate models obtained during the QAT process (Quant models) into MKL-DNN INT8 models. We call this procedure Quant2.

2. How to turn an FP32 model into a Quant model?

A procedure on how to transform an FP32 model into a Quant model supported by the Quant2 approach is described in this document.

3. How to turn a Quant model into an INT8 MKL-DNN model?

A Quant model can be transformed into an INT8 quantized model if it contains enough information about quantization scales for every quantized operator in the graph. The process of quantization is done by the Quant2Int8MkldnnPass pass which comprises several steps:

Gathering scales

The information about the quantization scales is collected from two sources:

the out_threshold attribute of quantizable operators - it contains a single value quantization scale for the operator's output,
fake quantize/dequantize operators - they imitate quantization from FP32 into INT8, or dequantization in reverse direction, but keep the quantized tensor values as floats.

There are three types of fake quantize/dequantize operators:

fake_quantize_moving_average_abs_max and fake_quantize_range_abs_max - used before quantized operator (e.g. conv2d), gather single value scale information for the op's input,
fake_dequantize_max_abs and fake_channel_wise_dequantize_max_abs - used after quantized operators, contain scales used for the operators' weights dequantization; the first one collects a single value scale for the weights tensor, whereas the second one collects a vector of scales for each output channel of the weights,
fake_quantize_dequantize_moving_average_abs_max - used after a quantized operator to get the scale value for the op's output; imitates immediate quantization and dequantization.

Scale values gathered from the fake quantize/dequantize operators have precedence over the scales collected from the out_threshold attributes.

Notes:

As the next steps describe, quantization will be applied later to an optimized FP32 model. It means that quantization scales for inputs and outputs of each quantized operator have to be gathered for tensors which are inputs and outputs of already optimized or fused operators. For example, if a model contains the following sequence of tensors and operators in the graph ... → input1 → conv2d → output1 → batch_norm → output2 → relu → output3 → ... and we want to quantize the conv2d op, then after applying FP32 optimizations the sequence will become ... → input1 → conv2d → output3 → ... and the quantization scales have to be collected for the input1 and outpu3 tensors in the Quant model.
Quantization of the following operators is supported: conv2d, depthwise_conv2d, mul, fc, matmul, pool2d, reshape2, transpose2, concat.
The longest sequence of consecutive quantizable operators in the model, the biggest performance boost can be achieved through quantization: ... → conv2d → conv2d → pool2d → conv2d → conv2d → ... Quantizing single operator separated from other quantizable operators can give no performance benefits or even slow down the inference: ```... → swish → fc → softmax → ...`

Removing fake operators

All the fake_quantize_* and fake_dequantize_* operators are being removed from the graph.

Dequantizing weights

Weights of conv2d, depthwise_conv2d and mul operators are assumed to be fake-quantized (with integer values in the int8 range, but kept as floats) in Quant models. Here, the information about the scale from fake_dequantize_max_abs and fake_channel_wise_dequantize_max_abs operators is used to fake-dequantize the weights back to the full float range of values. At this moment the model becomes an unoptimized clean FP32 inference model.

Optimizing FP32 graph

A series of standard optimization passes are being applied to the FP32 graph. This gives us an optimized FP32 inference model and we can proceed with INT8 quantization.

Computing weight scales

After optimization fuses, the weight tensors of conv2d or fc operators are likely to have different values and require new quantization scales. The weights are static, i.e. they do not change during the inference process, and the scales can be calculated simply as a maximum of absolute values from the tensor. To improve the inference accuracy we calculate the scales for each output channel separately, getting an array of quantization scales for a weight tensor.

Taking activations into account

The basic datatype used during INT8 inference is signed INT8, with possible values from -128 to 127. However, if conv2d or fc operator has relu or relu6 activation integrated in it, the output of the operator is known to have non-negative values. In that case we use unsigned INT8 datatype for output tensors, with a wider range for positive values (0 to 255), improving the inference accuracy further.

Propagation of scales

Some of the operators (e.g. reshape2, transpose2, pool2d with max pooling) transform the data without changing the quantization scale. For this reason we propagate the quantization scale values through these operators without any modifications. We propagate the quantization scales also through the scale operator, updating the quantization scale accordingly. This approach lets us minimize the number of fake quantize/dequantize operators in the graph, because the information about the scales required for the quantization process to succeed spreads between quantized operators.

Applying quantization passes

Having gathered all the data needed for quantization we apply the cpu_quantize_pass which quantizes the graph, and the cpu_quantize_squash_pass which optimizes the INT8 graph.

4. Code example

The code snipped shows how the Quant2Int8MkldnnPass can be applied to a model graph:

    import paddle.fluid as fluid
    from paddle.fluid.contrib.slim.quantization import Quant2Int8MkldnnPass
    from paddle.fluid.framework import IrGraph
    from paddle.fluid import core	
    
    # Create the IrGraph by Program
    graph = IrGraph(core.Graph(fluid.Program().desc), for_test=False)
    place = fluid.CPUPlace()
    # Convert the IrGraph to MKL-DNN supported INT8 IrGraph using the
    # Quant2Int8MkldnnPass. It requires a list of operators to be quantized
    mkldnn_pass = Quant2Int8MkldnnPass({'conv2d', 'pool2d'}, fluid.global_scope(), place, fluid.core, False)
    # Apply Quant2Int8MkldnnPass to IrGraph
    mkldnn_pass.apply(graph)

5. Accuracy and Performance benchmark

This section contain Quant2 MKL-DNN accuracy and performance benchmark results measured on the following server:

Intel(R) Xeon(R) Gold 6271 (with AVX512 VNNI support),

Performance benchmarks were run with the following environment settings:

The benchmark threads were assigned to cores by setting

export KMP_AFFINITY=granularity=fine,compact,1,0
export KMP_BLOCKTIME=1

Turbo Boost was set to OFF using the command

echo 1 | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo

Image classification models benchmark results

Accuracy

Intel(R) Xeon(R) Gold 6271

Model	FP32 Top1 Accuracy	INT8 Quant Top1 Accuracy	Top1 Diff	FP32 Top5 Accuracy	INT8 Quant Top5 Accuracy	Top5 Diff
MobileNet-V1	70.78%	70.71%	-0.07%	89.69%	89.41%	-0.28%
MobileNet-V2	71.90%	72.11%	+0.21%	90.56%	90.62%	+0.06%
ResNet101	77.50%	77.64%	+0.14%	93.58%	93.58%	0.00%
ResNet50	76.63%	76.47%	-0.16%	93.10%	92.98%	-0.12%
VGG16	72.08%	71.73%	-0.35%	90.63%	89.71%	-0.92%
VGG19	72.57%	72.12%	-0.45%	90.84%	90.15%	-0.69%

Performance

Image classification models performance was measured using a single thread. The setting is included in the benchmark reproduction commands below.

Intel(R) Xeon(R) Gold 6271

Model	FP32 (images/s)	INT8 Quant (images/s)	Ratio (INT8/FP32)
MobileNet-V1	74.05	196.98	2.66
MobileNet-V2	88.60	187.67	2.12
ResNet101	7.20	26.43	3.67
ResNet50	13.23	47.44	3.59
VGG16	3.47	10.20	2.94
VGG19	2.83	8.67	3.06

Notes:

Performance FP32 (images/s) values come from INT8 MKL-DNN post-training quantization document.

NLP models benchmark results

Accuracy

Intel(R) Xeon(R) Gold 6271

Model	FP32 Accuracy	Quant INT8 Accuracy	Accuracy Diff
Ernie	80.20%	79.44%	-0.76%

Performance

Intel(R) Xeon(R) Gold 6271

Model	Threads	FP32 Latency (ms)	Quant INT8 Latency (ms)	Ratio (FP32/INT8)
Ernie	1 thread	237.21	79.26	2.99x
Ernie	20 threads	22.08	12.57	1.76x

6. How to reproduce the results

The steps below show, taking ResNet50 as an example, how to reproduce the above accuracy and performance results for Image Classification models. To reproduce NLP models results (Ernie), please follow How to reproduce Ernie Quant results on MKL-DNN.

Prepare dataset

Download the dataset for image classification models benchmarking by executing:

cd /PATH/TO/PADDLE
python paddle/fluid/inference/tests/api/full_ILSVRC2012_val_preprocess.py

The converted data binary file is saved by default in $HOME/.cache/paddle/dataset/int8/download/int8_full_val.bin

Prepare models

Run the following commands to download and extract Quant model:

mkdir -p /PATH/TO/DOWNLOAD/MODEL/
cd /PATH/TO/DOWNLOAD/MODEL/
export QUANT_MODEL_NAME=resnet50
export QUANT_MODEL_ARCHIVE=${QUANT_MODEL_NAME}_quant.tar.gz
wget http://paddle-inference-dist.bj.bcebos.com/int8/QAT2_models/${QUANT_MODEL_ARCHIVE}
mkdir ${QUANT_MODEL_NAME} && tar -xvf ${QUANT_MODEL_ARCHIVE} -C ${QUANT_MODEL_NAME}

To download other Quant models, set the QUANT_MODEL_NAME variable in the above commands to one of the values: resnet101, mobilenetv1, mobilenetv2, vgg16, vgg19.

Download clean FP32 model for accuracy comparison against the INT8 model:

cd /PATH/TO/DOWNLOAD/MODEL/
export FP32_MODEL_NAME=resnet50
export FP32_MODEL_ARCHIVE=${FP32_MODEL_NAME}_int8_model.tar.gz
wget http://paddle-inference-dist.bj.bcebos.com/int8/${FP32_MODEL_ARCHIVE}
mkdir ${FP32_MODEL_NAME} && tar -xzvf ${FP32_MODEL_ARCHIVE} -C ${FP32_MODEL_NAME}

To download other FP32 models, set the FP32_MODEL_NAME variable to on of the values: Res101, mobilenetv1, mobilenet_v2, VGG16, and VGG19.

Run benchmark

Accuracy benchmark commands

You can use the quant2_int8_image_classification_comparison.py script to reproduce the accuracy result of the INT8 Quant models. The following options are required:

--quant_model - a path to a Quant model that will be transformed into INT8 model.
--fp32_model - a path to an FP32 model whose accuracy will be measured and compared to the accuracy of the INT8 model.
--infer_data - a path to the validation dataset.

The following options are also accepted:

--ops_to_quantize - a comma-separated list of operator types to quantize. If the option is not used, an attempt to quantize all quantizable operators will be made, and in that case only quantizable operators which have quantization scales provided in the Quant model will be quantized. When deciding which operators to put on the list, the following have to be considered:
- Only operators which support quantization will be taken into account.
- All the quantizable operators from the list, which are present in the model, must have quantization scales provided in the model. Otherwise, quantization of the operator will be skipped with a message saying which variable is missing a quantization scale.
- Sometimes it may be suboptimal to quantize all quantizable operators in the model (cf. Notes in the Gathering scales section above). To find the optimal configuration for this option, user can run benchmark a few times with different lists of quantized operators present in the model and compare the results. For Image Classification models mentioned above the list usually comprises of conv2d and pool2d operators.
--op_ids_to_skip - a comma-separated list of operator ids to skip in quantization. To get an id of a particular operator run the script with the --debug option first (see below for the description of the option), and having opened the generated file int8_<some_number>_cpu_quantize_placement_pass.dot find the id number written in parentheses next to the name of the operator.
--debug - add this option to generate a series of *.dot files containing the model graphs after each step of the transformation. For a description of the DOT format see DOT. The files will be saved in the current location. To open the *.dot files use any of the Graphviz tools available on your system (e.g. xdot tool on Linux or dot tool on Windows, for documentation see Graphviz).

cd /PATH/TO/PADDLE
OMP_NUM_THREADS=28 FLAGS_use_mkldnn=true python python/paddle/fluid/contrib/slim/tests/quant2_int8_image_classification_comparison.py --quant_model=/PATH/TO/DOWNLOADED/QUANT/MODEL --fp32_model=/PATH/TO/DOWNLOADED/FP32/MODEL --infer_data=$HOME/.cache/paddle/dataset/int8/download/int8_full_val.bin --batch_size=50 --batch_num=1000 --acc_diff_threshold=0.01 --ops_to_quantize="conv2d,pool2d"

Notes: Due to a large amount of images in the int8_full_val.bin dataset (50 000), the accuracy benchmark may last long. To accelerate accuracy measuring, it is recommended to set OMP_NUM_THREADS to the maximum number of physical cores available on the server.

Performance benchmark commands

To reproduce the performance results, the environment variable OMP_NUM_THREADS=1 and --batch_size=1 option should be set.

Transform the Quant model into INT8 model by applying the Quant2Int8MkldnnPass pass and save the result. You can use the script save_quant_model.py for this purpose. It also accepts the option --ops_to_quantize with a list of operators to quantize.
```
cd /PATH/TO/PADDLE/build
python ../python/paddle/fluid/contrib/slim/tests/save_quant_model.py --quant_model_path=/PATH/TO/DOWNLOADED/QUANT/MODEL --int8_model_save_path=/PATH/TO/SAVE/QUANT/INT8/MODEL --ops_to_quantize="conv2d,pool2d"
```

Run the C-API test for performance benchmark.

cd /PATH/TO/PADDLE/build
OMP_NUM_THREADS=1 paddle/fluid/inference/tests/api/test_analyzer_quant_image_classification ARGS --enable_fp32=false --with_accuracy_layer=false --int8_model=/PATH/TO/SAVED/QUANT/INT8/MODEL --infer_data=$HOME/.cache/paddle/dataset/int8/download/int8_full_val.bin --batch_size=1 --paddle_num_threads=1