* Fix Mask rcnn predictor
1. refine memory optim algorithm to support the model with the block op.
2. output diff : modify the affine channel fuse
3. add condition_block_infer op
add interface for setting trt calib table dir
test=develop
* add the missing files.
test=develop
* rename mkldnn set/get_cur_thread_id() to set/get_cur_mkldnn_session_id()
test=develop
* update session id definition and adjust logic for default behavior
test=develop
* reset logic in mkldnn reuse as most of cases work in default.
test=develop
1. Since allreduce op has 4 reduce types, We split these four reduce types into four ops
2. We also refined the collective op code, e.g. we separated the collective op kernel into CPUKernel and CUDAKernel, and remove the device specified DeviceContext parameter in template as we already knew the target DeviceContext
3. We remove the newly added Collective op role to reduce the complexity of program and graph analysis
* Fix bug in quantize kernel which cause crash in vgg16/19 model
test=develop
* refine the code to reduce verbose code; test=develop
* remove useless code; test=develop
1. some key generation method is not aligned with PR#17965
2. enlarge ptr lifetime to avoid memory release if SetBlob fails
otherwise it will get core dump.
test=develop
* fix prepare context redundant code problem, optimize executor by caching create_varaiables
test=develop
* supports collective training in executor
* make fetch_list runable with variables, add more unittest for use_program_cache
test=develop
* fix comment
test=develop
* use unique name for nccl_id
* supports output to stream in program_to_code
* insert sync_comm_stream before regularization; add skip_op_callstack capability in program_to_code
* set op role in collective training
* add collective op role
* remove orig file
* add build optimizer by strategy
* add collective strategy
* refine collective strategy
* add multi-process role maker
* refine strategy building factory so that we can easily plugin more strategy
* scale loss grad in collective sgd transpiler
* add support for distributed fc
* code format
* revert some features for dist fc
* add support for distributed fc training
* fix prepare context redundant code problem, optimize executor by caching create_varaiables
test=develop
* supports collective training in executor
* make fetch_list runable with variables, add more unittest for use_program_cache
test=develop
* use unique name for nccl_id
* supports output to stream in program_to_code
* insert sync_comm_stream before regularization; add skip_op_callstack capability in program_to_code
* set op role in collective training
* add collective op role
* fix comment
test=develop
* remove orig file
* add build optimizer by strategy
* add collective strategy
* refine collective strategy
* add multi-process role maker
* refine strategy building factory so that we can easily plugin more strategy
* scale loss grad in collective sgd transpiler
* add support for distributed fc
* code format
* revert some features for dist fc
* add support for distributed fc training
* test=develop
add collective op unittest standard
* test=develop
remove the test_collective directory
* test=develop
remove the test_collective directory
* remove slicegather test
* code format for reducescatter
* update attr of shard_index_op
* Modify macro nccl_helper
* remove test without distribute
* macro collective_helper
* marcro update
* test=develop
update support python3.5
* test=develop change gpu memory use to 0.1 when test
* test=develop
update ut equal func
* test=develop
set flags to 1.5
* test=develop fix pickle dumple py35
* test=develop
fix divide in slice and add sync_comm_stream
update atol and rtol to 1e-05
rm shard_index op and test
modify read input from file to read from memory
remove origin_program in framework and add i/o in c_sync_calc_stream
* test=develop update unittest sync operator I/O
1. fix the bug that out_put_var in SaveSelectedRows would be empty string
2. use merge_sparse_lookup_table to replace sum op for load_persistables_for_inference
3. fix the bug in _clone_var_in_block_ when the var is SELECTED_ROWS.
* test=develop
fix type error of std::pow in sigmoid_focal_loss_op.cu and sigmoid_focal_loss_op.h
* test=develop
fix wrong code stype in sigmoid_focal_loss_op.cu and sigmoid_focal_loss_op.h
* Update backward.py:
- If there is no input grad var in all outputs of previous ops, do not append this op into graph.
- Only apply this stragety when double backward.
* Update some double backward op.
* Update sum_op to judge whether a tensor is empty by numel or IsInitialized().
* test=develop add target assign for retinanet
* test=develop
run ci
* test=developp
add test_layers
* test=develop
add APi.spec
* test=develop
alter round 1
* test=develop
alter rpn_target_assign_op.cc
* test=develop
alter test_rpn_target_assign_op.py
* test=develop
alter rpn_target_assign_op.cc
* test=develop
alter API.spec
* test=develop
alter paddle/fluid/operators/detection/rpn_target_assign_op.cc
* test=develop
alter rpn_target_assign_op.cc
* test=develop
alter python/paddle/fluid/layers/detection.py
* test=develop
alter paddle/fluid/API.spec
* refractor the function ConvFwdPrimitiveDesc
test=develop
* change according to review
test=develop
* use pointer way without boost::optional
test=develop
* pass vector to function by reference instead of raw vector
test=develop
* change pointer to shared_ptr
test=develop
* test=develop
The scatter op has a calc bug when the indices has same index, the scatter op use overwrite mode to calculate the same index, fix this bug by using the accumulate mode to calculate the same index.At the same time, the gather op has the same bug when the op calc the grad. And we use the lib of open-blas and eigen to optimize the time cost in accumulate mode.
* test=develop
Fix some code format problem, and the same time add the test case in gather and scatter op
Fix bug in sequence_unpad op, when allocate the output memory do not match actual memory, check memory failed. Fix this bug by allocating the output memeory in correct code position.
* add deformable psroi pooling
* test=develop
* test=develop
* test=develop
modify format
* fix bug
* test=develop run ci
* test=develop
add API.spec
* add test_layers.py
* run ci again
* test=develop
run ci again
* run ci again
* test=develop
run ci again
* test=develop
run ci again
* test=develop
run ci again
* add space between two lines
* test=develop
add space between two lines
* test=develop
add space between lines
* test=develop
modify comment in nn.py
* test=develop
add space between two lines
* test=develop
add space between two lines
* update API.spec
* run ci again
* test=develop
run ci again
* rerun ci
* test=develop
rerun ci
* change input shape
* run ci
* test=develop
run ci
* modify format of nn.py
* test=develop
* test=develop
* test=develop
update API.spec
* test=develop
fix API doc
* modify API comment
* modift API comment
* test=develop
update API.spec
* test=develop
modify comment
* test=develop
modift comment
* test=develop
modift comment
* test=develop
update API.spec
* test=develop
modify comment
* test=develop
add inference in nn.py
* test=develop
update API.spec
* test=develop
resolve confict
* test=develop
update API.spec
* add unfold op
test=develop
* fix divide bug in python3 when calculating output width and height
test=develop
* add name=None in python api, move redundant code into inline function
* try to trigger ci for this code
test=develop
* Enable seq_pool op to accept len 0 input
test=develop
* Update sequence_pool's api
test=develop
* Add more unittest cases for seq_pool op
test=develop
* Remove legacy comments
test=develop
* Don't use template in op maker
test=develop
* fix the bug of mobilenet-ssd INT8 inference without overloading GetHash
test=develop
* remove the out_grad->format() in TransposeMKLDNNGradOpKernel
test=develop
* Enhance fused_elementwise_activation op.
test=develop
* Move the api fused_elementwise_activation to contrib.
test=develop
* Add including files.
test=develop
* Add the support of sigmoid in fused_elementwise_activetion op.
* Update API.spec.
test=develop
* Optimize the concat and split kernel for special cases that the number of inputs/outputs is 2.
test=develop
* Refine codes.
test=develop
* Correct the condition.
test=develop
* Move the define of tmp_data outside the if statement.
* Print the cudnn minor version.
test=develop
* Fix the case when in_num/o_num is 1 in concat/split op.
test=develop
* Remove const_cast.
test=develop
* add INT8 conv+relu6 fuse and enbale mobilentv2 INT8 test
test=develop
* change fasle and 0.0 to fuse_brelu and brelu_threshold
test=develop
change the "fuse_relu||fuse_brelu" to "unsigned_output"
test=develop
* Use relu instead of brelu as INT8 post-op because INT8 brelu is not enabled in mkldnn v0.18
test=develop
* continuous-integration fix
test=develop
* add Concat quantization
add unit test for quantizing concat
fix for wrong value when the input is not in map of calculated scales
add use_quantizer to concat_op.cc
add scale_algo rules for concat
test=develop
* missing fix for multiple inputs quantize-squash
* wojtuss review fix: adding comment
test=develop
* fluid int8 train and trt int8 predict align.
trt int8 predict init
op converter
* 2. align fluid int8 train and trt int8 inference.
enhance quant dequant fuse pass
enhance op converter, trt engine, trt engine op, trt subgraph pass.
* 3. add delete_quant_dequant_pass for trt
test=develop
* 4. add the missing file
test=develop
* 5. i modify the c++ interface, but forget to modify the pybind code
fix the IS_TRT_VERSION_GE bug, and fix elementwise op converter
test=develop
* fuse mul and elementwise add to fc
* Reimplement the FC forward operator
* Fix FC MKLDNN integration by transposing weights
* Add FC MKLDNN Pass
test=develop
* FC MKLDNN Pass: change memcpy to std::copy
* Fix MKLDNN FC handling of mismatch input and weights dims
* Lower tolerance for MKL-DNN in resnet50 test
test=develop
* Adjust FC to support MKLDNN Op placement
test=develop
* Adjust Placement Op to set use_mkldnn attribute for graph
test=develop
* MKLDNN FC: fix weights format so that gemm version is called
test=develop
* FC MKLDNN: Remove tolerance decrease from tester_helper
* FC MKL-DNN: Refactor the code, change input reorder to weight reorder
* MKL-DNN FC: Introduce operator caching
test=develop
* FC MKL-DNN: Fix the tensor type in ExpectedKernelType
test=develop
* FC MKL-DNN: fix style changes
test=develop
* FC MKL-DNN: fallback to native on non-supported dim sizes
test=develop
* FC MKLDNN: fix CMake paths
test=develop
* FC MKLDNN: Refine placement pass graph mkldnn attribute
test=develop
* Fix Transpiler error for fuse_conv_eltwise
test=develop
* Fix missing STL includes in files
test=develop
* FC MKL-DNN: Enable new output size computation
Also, refine pass to comply with newest interface.
test=develop
* FC MKL-DNN: enable only when fc_mkldnn_pass is enabled
* FC MKL-DNN: Allow Weights to use oi or io format
* FC MKL-DNN: Adjust UT to work with correct dims
test=develop
* Enable MKL DEBUG for resnet50 analyzer
test=develop
* FC MKL-DNN: Improve Hashing function
test=develop
* FC MKL-DNN: Fix shape for fc weights in transpiler
* FC MKL-DNN: Update input pointer in re-used fc primitive
* Add log for not handling fc fuse for unsupported dims
test=develop
* FC MKL-DNN: Move transpose from pass to Op Kernel
test=develop
* FC MKL-DNN: Disable transpose in unit test
test=develop
* FC MKL-DNN: Remove fc_mkldnn_pass from default list
* Correct Flag for fake data analyzer tests
test=develop
* FC MKL-DNN: Add comment about fc mkldnn pass disablement
test=develop
* FC MKL-DNN: Disable fc in int8 tests
test=develop
* fix quantize_squash_pass segfault when there is no tensor linked do Bias input
test=develop
* add googlenet test
test=develop
* fix concat CreateKey not using input format
test=develop
* Relu6 is the bottleneck op for Mobilenet-v2. As the mkldnn supports the conv/relu6 fusion, we implement it fusion via cpass way. Due to the int8 enabling for this fusion will be supported in MKLDNN v0.20, so this PR is focused on the fp32 optimization.
Below table shows the benchmark(FPS) which measured on skx-8180(28 cores)
Batch size | with fusion | without fusion
-- | -- | --
1 | 214.7 | 53.4
50 | 1219.727 | 137.280
test=develop
* Fix the format issue
test=develop
* Add the missing nolint comments.
test=develop
* Fix the typos.
test=develop
* Register the conv_brelu_mkldnn_fuse_pass for the MKLDNN engine.
test=develop
* Adjust the indentation.
test=develop
* Add the test_conv_brelu_mkldnn_fuse_pass case.
test=develop
* Slightly update the code per Baidu comments.
Let the parameter definition embedded into the code.
That's will make the code easy to understand.
test=develop
* Optimize the elementwise op with CUDA kernels.
test=develop
* Support setting of attr in op config file.
test=develop
* Add the support the setting dtype and initializer in config.
test=develop
* Save workspace.
* Add initializer "zeros".
test=develop
* Fix compiling error.
* Support the use of existed file to initailize tensor in op_tester.
* Use eigen to optimize the elementwise_add/mul for the case that x and y have the same dims.
test=develop
* add double grad for elementwise_mul. test=develop
* remove comment. test=develop
* fix grad sum. test=develop
* fix for axis expand. test=develop
* add test for axis expand. test=develop
* Add conv2d_grad_grad_op
* Extracte the cuDNN conv algo searching code in conv_cudnn_helper.h.
- Now use it in conv2d_grad_grad.
- Will simply the searching code in conv2d and conv2d_grad in next PR.
* Enhance and fix bug in unit testing of gradient_checker.
* Support to fetch empty variables,return None in Python.
* Refine elementwise kernel.
Add a simple cuda kernel if grad x and y both exist
Use 2D block cuda kernel to do broadcast.
test=develop
Signed-off-by: zhaoyuchen <zhaoyuchen01@baidu.com>
* refine code.
test=develop
Signed-off-by: zhaoyuchen <zhaoyuchen01@baidu.com>
* refine code.
test=develop
Signed-off-by: zhaoyuchen <zhaoyuchen01@baidu.com>
* add use_cuda to inplace pass,test=develop
* add test softmax_with_xe_inplace test,test=develop
* fix potential inplace bug
test=develop
* add more skip vars in mem opt pass,test=develop
* follow comment,test=develop
* follow comments,move duplicate out arg check to program->graph,test=develop
* Add MovingAverageAbsMaxScale operator which is only used for calculating the quantization scale.
* test=develop
* change the output into inplace. test=develop
* Revert "test=develop"
This reverts commit 696cf62699ba1e1c98f61f7345ac7060010eb29a.
* Revert "change the output into inplace. test=develop"
This reverts commit a19acd20f07eee82622701a3015e6e9c073a5e0b.
* test=develop.
* update the MovingAverageAbsMaxScaleOp test. test=develop
* refine_dropout_mem,test=develop
* # This is a combination of 14 commits.
# The first commit's message is:
remove ut test_dist_word2vec in mac ci, will fix it in private, test=develop (#17066)
# This is the 2nd commit message:
Fleet unify distributed training (#16791)
* implement distributed transpiler with fleet
# This is the 3rd commit message:
ParallelDyGraph with GPU collective mode (#16827)
implement dygraph.parallel.DataParallel to hook reduce op.
# This is the 4th commit message:
Init mixed precision training interface (#16856)
* Init mixed precision training interface
* Add fp16 test script
test=develop
* All initializers support float16
test=develop
* Code cleanup & add more code annotations
test=develop
* Update API spec
test=develop
* Add usage example in doc
test=develop
# This is the 5th commit message:
fix reference_count_pass,test=develop (#17060)
test=develop
# This is the 6th commit message:
Speedup roi_perspective_transform op by caching the information of linear interpolation in forward (#17090)
* Cache the information of linear interpolation in forward and use it in backward.
test=develop
* Fix cuda kernel.
test=develop
# This is the 7th commit message:
remove unnecessary prepare_data (#17080)
test=develop
# This is the 8th commit message:
fix interpolate cu. test=develop (#17101)
# This is the 9th commit message:
test=develop, double backward leaky_relu (#17067)
backward of backward: leaky_relu
# This is the 10th commit message:
fix fuse optimizer ops (#17102)
test=develop
# This is the 11th commit message:
truncated_gaussian_random supported in distributed training, test=develop (#17091)
# This is the 12th commit message:
Detailed coordinate description for yolov3 loss (#17007)
* Detailed coordinate description for yolov3 loss
test=develop
* modified api.spec
test=develop
* modified loss name
* fix api.spec
test=develop
* polish description
test=develop
* modified api.spec
test=develop
# This is the 13th commit message:
fix test_weight_decay (#17109)
test=develop
# This is the 14th commit message:
Path flag (#17105)
* fix python/paddle/fluid/__init__.py detecting problems
1. Use CudnnWorkspaceHandle in exhaustive search of conv_cudnn.
2. For Ops using CudnnWorkspaceHandle in exhaustive search, release their GPU memory after exhaustive search.
test=develop