* add bn and relu fuse pass
* add op attr assert and dtype assert
* fix some inputs&&outputs bugs for the fused op and pattern.
* add the unittest for fuse_bn_act_pass. test=develop
* use normative enforce statements. test=develop
* add the cpu test. test=develop
* add the support of batch_size=1 for the bn with relu op. test=develop
* add the error type for paddle throws. test=develop
* add fused_batch_norm_act and fused_batch_norm_act_grad to op_has_unsed_vars_white_list. test=develop
* Add the dynamic load of nvrtc, and support runtime compiling of CUDA kernel using nvrtc.
test=develop
* Call CUDA driver api to launch the kernel compiled by nvrtc.
test=develop
* Disable for mac and windows.
test=develop
* Refine the codes to support manually specified num_threads and workload_per_thread.
test=develop
* Refine the CUDA kernel to support large dims.
test=develop
* Add DeviceCodePool to manage all device codes.
* Add the first implementation fusion_group op.
* Add unit-test for fusion_group op.
* Add the check of result.
* Add the check of nvrtc in unit-test.
test=develop
* Add comment to explain the inputs, outputs and features of fusion_group op.
test=develop
* Disable fusion_group op for mac and windows.
test=develop
* Make the compiling of device code return status instead of hanging up.
test=develop
* Add the check of whether there is CUDA driver library, and do not core dump when failing to call the CUDA driver API.
* Unify fusion_group_op's input and output names.
test=develop
* Add the check of CUDA driver library in unittest.
test=develop
* Refine the calling of PADDLE_ENFORCE.
test=develop
* test=develop, fix docker with paddle nccl problem
* don't expose numerous Tensor.set(), test=develop
* fix condition, test=develop
* fix float16 bug, test=develop
* feed should be Tensor or np.array, not Variable or number, test=develop
* use forcecast to copy numpy slice to new array, test=develop
* remove float16-uint16 hacking, test=develop
* add variable method to varbase and refactor to_variable to support return varbase
* support kwargs in varbase constructor
* add VarBase constructor to support default python args
* refine varbase initial method
* reset branch
* fix ut for change VarBase error info to PaddleEnforce
* cherry is parameter change before
* overload isinstance to replace too many change of is_variable
* rm useless files
* rm useless code merged by git
* test=develop, fix some ut failed error
* test=develop, fix test_graph_wrapper
* add some tests, test=develop
* refine __getitem__, test=develop
* add tests, test=develop
* fix err_msg, test=develop
* Implement Int8 FC
* Integrate FC into INT8v2
test=develop
* int8 FC: transpose weights before computing scales
test=develop
* Add support for activation_type string in FC
test=develop
* Disable MKL-DNN's FC in VGG16 and 19
test=develop
* Disable FC quantization when mkldnn FC is disabled
test=develop
* Solve PADDLE_ENFORCES in FC int8
* Fix Paddle enforces and remove const cast
test=develop
* Fix style changes
test=develop
* Fix quantizer_tester test and add fc quantization
test=develop
* Fix FC test fail on CUDA
* Remove unnecessary log from quantize placement pass
test=develop
* Add Thread ID to FC hash key
test=develop
* Add comments to MKL-DNN FC Kernel
test=develop
* Refactor quantizer
test=develop
* Fix linter issues
test=develop
* Fix crash in slim googlenet
test=develop
* Fix PADDLE_ENFORCE messages
test=develop
* Fix TensorRT detection bug
1. Add new search path for TensorRT at tensorrt.cmake
2. Add better debug message
3. Fix the bug of detection of TensorRT version
In NVIDIA official docker image, TensorRT headers are located at
`/usr/include/x86_64-linux-gnu` and TensorRT libraries are located
at `/usr/lib/x86_64-linux-gnu`, so using `-DTENSORRT_ROOT` will
fail to detect TensorRT.
There is no debug/warning message to tell developer that TensorRT
is failed to be detected.
In later version of TensorRT (e.g. v6), `NV_TENSORRT_MAJOR` is
defined at `NvInferVersion.h` instead of `NvInfer.h`, so add
compatibility fix.
* Fix TensorRT variables in CMake
1. Replace `${TENSORRT_ROOT}/include` with `${TENSORRT_INCLUDE_DIR}`
2. Replace `${TENSORRT_ROOT}/lib` with `${TENSORRT_LIBRARY}`
Manually type path may locate incorrect path of TensorRT. Use the
paths detected by system instead.
* Fix TensorRT library path
1. Add new variable - `${TENSORRT_LIBRARY_DIR}`
2. Fix TensorRT library path
inference_lib.cmake and setup.py.in need the path of TensorRT library
instead of the file of TensorRT library, so add new variable to fix it.
* Add more general search rule for TensoRT
Let system detect architecture instead of manually assign it, so
replace `x86_64-linux-gnu` with `${CMAKE_LIBRARY_ARCHITECTURE}`.
* Add more general search rule for TensorRT
Remove duplicate search rules for TensorRT libraries. Use
`${TENSORRT_LIBRARY_DIR}` to get full path of libnvinfer.so
test=develop
* Enrich the type of error and declare the error type interfaces, test=develop
* adjust tests to adapt new form, test=develop
* add inference deps with error_codes.pb.h, test=develop
* restore stack iter start pos, test=develop
* polish code based review comments, test=develop
* support no need buffer vars in dygraph, test=develop
* fix inference compilation error, test=develop
* update no_need_buffer_vars_inference, test=develop
* add unittests for no_need_buffer_vars_context, test=develop
* refine no_need_buffer_vars by return ref, test=develop
* polish some codes, test=develop
1.support asymmetric padding;
2.support padding algorithm:"SAME" and "VALID";
3.support channel_last: data_format NHWC and NDHWC;
4.change doc of python API and c++;
test=develop, test=document_preview
* Add fc_elementwise_layernorm_fuse pass and unittest.
* Add fused_fc_elementwise_layernorm op and its GPU kernel.
test=develop
* Apply fc_elementwise_layernorm_fuse_pass to GPU inference.
* Add the setting of attrs in the definition of binary_op.
test=develop
* Add comment.
* Implement the unittest.
test=develop
* Change the unittest name of layer_norm.
test=develop
TemporaryAllocator is a singleton used for allocating memory for Cudnn. Since it is a singleton, we can delete it for better performance in memory.
We replace TemporaryAllocator by CUDADeviceContextAllocator and CUDADeviceContextAllocation, which uses stream callback to delete the memory allocated for the stream to avoid singleton.
Also added data_feed_proto to operator to fix CI in CPU compilation
* Refine the codes related to fc op.
* Add GPU implementation for fc functor.
* Apply fc_fuse_pass in GPU inference.
test=develop
* Change the cmake for fc op.
* Change PADDLE_ENFORCE to PADDLE_ENFORCE_EQ.
* Add an attribute to set the activation type in fc_op.
* Enhance the unittest of fc_op.
test=develop
* Remove the declaration of FCOpGrad back to the header file.
test=develop
* Set default value for newly added arguments in test_fc_op.
test=develop
* Support looking up embeddings from BoxPS.
* Add a _pull_box_sparse op, for now this op is not exposed to users.
* Add a BoxHelper class, providing 'BeginPass', 'EndPass', 'FeedPass' functions and so on.
* Add 'BoxPSDataset' in python code.
* Add a compile options WITH_BOX_PS and a MACRO PADDLE_WITH_BOX_PS.
* Add UT.
* More concrete information pls refer to: https://github.com/PaddlePaddle/Paddle/pull/18982
* Implement the operator with sprase matrix multiply
* Update the URL of mklml library.
test=develop
* Disable MKLML implematation when using no-linux.
test=develop
* Ignore the deprecated status for windows
test=develop
* test=develop,fix the inference library compilation bug on windows
* test=develop,Fix the inference library compilation bug on windows
* test=develop,fix the bug that PYTHON_EXECUTABLE not exists
* update anakin-engine interfaces for content-dnn
test=develop
* support only-gpu mode of Anakin
modify eltwise parse
test=develop
* modification for thread-safe
test=develop
* Integrated template instance
test=develop
* increase template parameters
test=develop
* support MLU predictor
test=develop
* update anakin cmake files
test=develop
* update TargetWrapper::set_device
* update the initialization of anakin subgraph
test=develop
* use the default constructor of base class
test=develop
* load model from buffer with length
test=develop
* modify the access level of class
test=develop
* support anakin for bitmain arch
test=develop
* remove files
* checkout cmakelists
test=develop
* Optimize the concat and split kernel for special cases that the number of inputs/outputs is 2.
test=develop
* Refine codes.
test=develop
* Correct the condition.
test=develop
* Move the define of tmp_data outside the if statement.
* Print the cudnn minor version.
test=develop
* Fix the case when in_num/o_num is 1 in concat/split op.
test=develop
* Remove const_cast.
test=develop
* fuse mul and elementwise add to fc
* Reimplement the FC forward operator
* Fix FC MKLDNN integration by transposing weights
* Add FC MKLDNN Pass
test=develop
* FC MKLDNN Pass: change memcpy to std::copy
* Fix MKLDNN FC handling of mismatch input and weights dims
* Lower tolerance for MKL-DNN in resnet50 test
test=develop
* Adjust FC to support MKLDNN Op placement
test=develop
* Adjust Placement Op to set use_mkldnn attribute for graph
test=develop
* MKLDNN FC: fix weights format so that gemm version is called
test=develop
* FC MKLDNN: Remove tolerance decrease from tester_helper
* FC MKL-DNN: Refactor the code, change input reorder to weight reorder
* MKL-DNN FC: Introduce operator caching
test=develop
* FC MKL-DNN: Fix the tensor type in ExpectedKernelType
test=develop
* FC MKL-DNN: fix style changes
test=develop
* FC MKL-DNN: fallback to native on non-supported dim sizes
test=develop
* FC MKLDNN: fix CMake paths
test=develop
* FC MKLDNN: Refine placement pass graph mkldnn attribute
test=develop
* Fix Transpiler error for fuse_conv_eltwise
test=develop
* Fix missing STL includes in files
test=develop
* FC MKL-DNN: Enable new output size computation
Also, refine pass to comply with newest interface.
test=develop
* FC MKL-DNN: enable only when fc_mkldnn_pass is enabled
* FC MKL-DNN: Allow Weights to use oi or io format
* FC MKL-DNN: Adjust UT to work with correct dims
test=develop
* Enable MKL DEBUG for resnet50 analyzer
test=develop
* FC MKL-DNN: Improve Hashing function
test=develop
* FC MKL-DNN: Fix shape for fc weights in transpiler
* FC MKL-DNN: Update input pointer in re-used fc primitive
* Add log for not handling fc fuse for unsupported dims
test=develop
* FC MKL-DNN: Move transpose from pass to Op Kernel
test=develop
* FC MKL-DNN: Disable transpose in unit test
test=develop
* FC MKL-DNN: Remove fc_mkldnn_pass from default list
* Correct Flag for fake data analyzer tests
test=develop
* FC MKL-DNN: Add comment about fc mkldnn pass disablement
test=develop
* FC MKL-DNN: Disable fc in int8 tests
test=develop
* link the libwbaes.so into paddle
* polish detail, test=develop
* try fix mac_pr_ci error, test=develop
* add compile option, test=develop
* fix ci error, test=develop
* ignore failed to find mac lib, test=develop
* change cdn to bj, cdn can't get the latest version
* trigger ci, test=develop
* temporary delete win32 lib linking, test=develop
* change https to http, test=develop
* turn compile option on to off
* turn compile option off to on, test=develop
* try lib compiled by gcc4.8, test=develop
* update lib version, test=develop
* link other lib, test=develop
* add setup config
* delete false, test=develop
* delete no_soname, test=develop
* recover so name set
* fix, test=develop
* adjust make config, test=develop
* remove link to wbaes, test=develop
* remove useless define, test=develop
* Support Sync Batch Norm.
* Note, do not enable it in one device.
Usage:
build_strategy = fluid.BuildStrategy()
build_strategy.sync_batch_norm = True
binary = fluid.compiler.CompiledProgram(tp).with_data_parallel(
loss_name=loss_mean.name,
build_strategy=build_strategy)
* Upgrade MKLDNN to v0.18-rc and fix issue caused by lib/lib64
Upgrade MKLDNN to v0.18-rc
Also fix the issue during upgrade
test=develop
* Rebase MKLDNN to rls-v0.18 branch
Some issues in v0.18-rc which caused INT8 conv op unit test failure was fixed
in rls-v0.18 branch
test=develop
* Upgrade MKLDNN from v0.18rc to formal v0.18 tag
test=develop
* Fix the windows compile issue.
test=develop
* Optimize for gelu operator
* Set up the low accuracy mode of MKL ERF function.
test=develop
* Only enable MKLML ERF when OS is linux
* Use the speical mklml version included vmsErf function to verify gelu mkl kernel.
test=develop
* Add the CUDA macro to avoid NVCC's compile issue.
test=develop
* Add the TODO comments for mklml library modification.
test=develop
* Clean Code
test=develop
* Add the comment of marco for NVCC compiler.
test=develop
* Simplify the compare op for CPU.
* Use asynchronous tensor copy in reshape_op's kernel.
* Optimize while_op for test, avoiding creating variables every time.
test=develop
* Enable the cache of kernel type and kernel function.
test=develop
* Enable profiling with gperftools.
* Remove flags for testing, and fix the linking error.
test=develop
* Delete the codes of ChooseKernel.
test=develop
* Fix bug when preparing ExecutorPrepareContext for while_op.
* Fix missing depending on grpc libraries.
* Remove the redundant print.
test=develop
* Follow comments.
* Remove the codes related to prepare the ExecutorPrepareContext for while_op.
test=develop