- Refactor step 1
- Compilation fix
- Yet another compilation fix
- Even more compilation fix
- Lint fixes
test=develop
- Removed deprectaed PADDLE_ENFORCE occurance
test=develop
- Candidate fix to BN forward
- Lint fixes
test=develop
- Refactoring in data_layout_transform
- compilation fix
- Another comppilation fix
- Step further into darkness
- Yet another compilation fix
- Yet another compilation fix
- missing header
- compilation fix
- Added MKLDNN -> Paddle conversion in fetch op
test=develop
- Compilation fix
test=develop
- Lint
test=develop
- Mul fix
- Fix to MKLDNN MUL op and Elementwise MUL UT
test=develop
- Workaround for diffrent weights with groups representation Paddle vs
MKL-DNN.
test=develop
- Candidate fix for 5D convolution with groups
- Refactor of fix for conv3d and conv2d in fetch op
test=develop
- Compilation fix
- Still same compilation fix
- Compilation fix
- Compilation fix
- Reverted refactoring of fixes
- Adapted test_conv2d_int8_mkldnn so it exects data in NCHW format
not NHWC
test=develop
- minor fix in UT
test=develop
- Lint fixes
test=develop
* supports multiple NCCL communicators preserved in NCCLCommContext
test=develop
* add ut for c_comm_init_all operator and fix cuda resource release problem
test=develop
* replace part of PADDLE_ASSERT to PADDLE_ENFORCE
test=develop
* remove unused fallback_alloc_size_
* add unit-test of CUDAPinnedAllocator
test=develop
* Implement the operator with sprase matrix multiply
* Update the URL of mklml library.
test=develop
* Disable MKLML implematation when using no-linux.
test=develop
* Ignore the deprecated status for windows
test=develop
* fix warpctc.dll not found issue, test=develop
* revert the linux platform change, test=develop
* delete warpctc_lib_path.h.in, test=develop
* add SetPySitePackagePath function
* fix warpctc.dylib not found issue on Mac, test=develop
* improve the paddle lib path setting logic, test=develop
* fix mac ci issue caused by test_warpctc_op unittest, test=develop
* tweak code, test=develop
test=develop
- Extracted key generation from FWD and GRAD into separate function
test=develop
- Compilation fix
test=develop
- another compilation
test=develop
* change INT8 to template so that checking dst_dt with if-else could be removed. CI will be enabled after fixing reviews
* reverse user_residual_memory_p and user_bias_memory_p declaration scope
test=develop
optimize the error reporting information of cuda related API
index on develop: 130ac17 Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into develop
* feature/auto_growth_allocator, test=develop
* add unittest of AlignedAllocator, test=develop
* try to turn on auto_growth to test on CI, test=develop
* fix segmentation fault in mixed_vector.h, test=develop
* add unittests, test=develop
* rename mkldnn set/get_cur_thread_id() to set/get_cur_mkldnn_session_id()
test=develop
* update session id definition and adjust logic for default behavior
test=develop
* reset logic in mkldnn reuse as most of cases work in default.
test=develop
1. Since allreduce op has 4 reduce types, We split these four reduce types into four ops
2. We also refined the collective op code, e.g. we separated the collective op kernel into CPUKernel and CUDAKernel, and remove the device specified DeviceContext parameter in template as we already knew the target DeviceContext
3. We remove the newly added Collective op role to reduce the complexity of program and graph analysis
* Fix bug in quantize kernel which cause crash in vgg16/19 model
test=develop
* refine the code to reduce verbose code; test=develop
* remove useless code; test=develop
1. some key generation method is not aligned with PR#17965
2. enlarge ptr lifetime to avoid memory release if SetBlob fails
otherwise it will get core dump.
test=develop
* fix prepare context redundant code problem, optimize executor by caching create_varaiables
test=develop
* supports collective training in executor
* make fetch_list runable with variables, add more unittest for use_program_cache
test=develop
* fix comment
test=develop
* use unique name for nccl_id
* supports output to stream in program_to_code
* insert sync_comm_stream before regularization; add skip_op_callstack capability in program_to_code
* set op role in collective training
* add collective op role
* remove orig file
* add build optimizer by strategy
* add collective strategy
* refine collective strategy
* add multi-process role maker
* refine strategy building factory so that we can easily plugin more strategy
* scale loss grad in collective sgd transpiler
* add support for distributed fc
* code format
* revert some features for dist fc
* add support for distributed fc training
* fix prepare context redundant code problem, optimize executor by caching create_varaiables
test=develop
* supports collective training in executor
* make fetch_list runable with variables, add more unittest for use_program_cache
test=develop
* use unique name for nccl_id
* supports output to stream in program_to_code
* insert sync_comm_stream before regularization; add skip_op_callstack capability in program_to_code
* set op role in collective training
* add collective op role
* fix comment
test=develop
* remove orig file
* add build optimizer by strategy
* add collective strategy
* refine collective strategy
* add multi-process role maker
* refine strategy building factory so that we can easily plugin more strategy
* scale loss grad in collective sgd transpiler
* add support for distributed fc
* code format
* revert some features for dist fc
* add support for distributed fc training
* test=develop
add collective op unittest standard
* test=develop
remove the test_collective directory
* test=develop
remove the test_collective directory
* remove slicegather test
* code format for reducescatter
* update attr of shard_index_op
* Modify macro nccl_helper
* remove test without distribute
* macro collective_helper
* marcro update
* test=develop
update support python3.5
* test=develop change gpu memory use to 0.1 when test
* test=develop
update ut equal func
* test=develop
set flags to 1.5
* test=develop fix pickle dumple py35
* test=develop
fix divide in slice and add sync_comm_stream
update atol and rtol to 1e-05
rm shard_index op and test
modify read input from file to read from memory
remove origin_program in framework and add i/o in c_sync_calc_stream
* test=develop update unittest sync operator I/O
Add Pipeline Concurrency Train Mode:
- Cpp: pipeline_trainer & section_worker
- Python: PipelineOptimizer
- Add a new data_feed type: PrivateInstantDataFeed
- Add a test demo of pipeline trainer and the test model is gnn
- Do not support win32 now
* Relu6 is the bottleneck op for Mobilenet-v2. As the mkldnn supports the conv/relu6 fusion, we implement it fusion via cpass way. Due to the int8 enabling for this fusion will be supported in MKLDNN v0.20, so this PR is focused on the fp32 optimization.
Below table shows the benchmark(FPS) which measured on skx-8180(28 cores)
Batch size | with fusion | without fusion
-- | -- | --
1 | 214.7 | 53.4
50 | 1219.727 | 137.280
test=develop
* Fix the format issue
test=develop
* Add the missing nolint comments.
test=develop
* Fix the typos.
test=develop
* Register the conv_brelu_mkldnn_fuse_pass for the MKLDNN engine.
test=develop
* Adjust the indentation.
test=develop
* Add the test_conv_brelu_mkldnn_fuse_pass case.
test=develop
* Slightly update the code per Baidu comments.
Let the parameter definition embedded into the code.
That's will make the code easy to understand.
test=develop
* Add conv2d_grad_grad_op
* Extracte the cuDNN conv algo searching code in conv_cudnn_helper.h.
- Now use it in conv2d_grad_grad.
- Will simply the searching code in conv2d and conv2d_grad in next PR.
* Enhance and fix bug in unit testing of gradient_checker.
* Support to fetch empty variables,return None in Python.
* Refine elementwise kernel.
Add a simple cuda kernel if grad x and y both exist
Use 2D block cuda kernel to do broadcast.
test=develop
Signed-off-by: zhaoyuchen <zhaoyuchen01@baidu.com>
* refine code.
test=develop
Signed-off-by: zhaoyuchen <zhaoyuchen01@baidu.com>
* refine code.
test=develop
Signed-off-by: zhaoyuchen <zhaoyuchen01@baidu.com>
1. Use CudnnWorkspaceHandle in exhaustive search of conv_cudnn.
2. For Ops using CudnnWorkspaceHandle in exhaustive search, release their GPU memory after exhaustive search.
test=develop
* speedup gc and inplace softmax_with_cross_entropy_grad
test=develop
* refine models gpu mem
Merge skip vars and warning messages of mem opt
remove relu mem opt
test=develop
* follow comments
test=develop
* - Reuse of conv PD
- conv transpose pd reused
- Added PD reusing of softmax and Batch Norm
- Refactoring and removal of not needed routines of mkl-dnn ops
test=develop
- Fix to reusing conv
test=develop
- Lint fixes
test=develop
- Further lint fixes
test=develop
- Lint fixes
test=develop
- lint fixes
test=develop
- Lint workaround
test=develop
* - Fix after review on including boost as third party header
test=develop
* - Fix after review. Name change to something more descriptive
test=develop
* link the libwbaes.so into paddle
* polish detail, test=develop
* try fix mac_pr_ci error, test=develop
* add compile option, test=develop
* fix ci error, test=develop
* ignore failed to find mac lib, test=develop
* change cdn to bj, cdn can't get the latest version
* trigger ci, test=develop
* temporary delete win32 lib linking, test=develop
* change https to http, test=develop
* turn compile option on to off
* turn compile option off to on, test=develop
* try lib compiled by gcc4.8, test=develop
* update lib version, test=develop
* link other lib, test=develop
* add setup config
* delete false, test=develop
* delete no_soname, test=develop
* recover so name set
* fix, test=develop
* adjust make config, test=develop
* remove link to wbaes, test=develop
* remove useless define, test=develop
* Revert "[MKL-DNN] Fix to crash of Transformer when mkldnn is to be used (#16233)"
This reverts commit 13816dd4ac.
Apart from enabling transformer for MKL-DNN
* Revert "- MKL-DNN pooling updated to set_prim_desc"
This reverts commit c63f6b2039.
Conflicts:
paddle/fluid/operators/mkldnn/concat_mkldnn_op.cc
* Revert "[MKL-DNN] MKL-DNN specific Tensor modification (#15429)"
test=develop
This reverts commit dec9cf53c8.
* - concat compilation fix
- lint
test=develop
- Lint fixes
test=develop
- Lint fixes
test=develop
- Fix Transpose MKLDNN op
test=develop
* Support Sync Batch Norm.
* Note, do not enable it in one device.
Usage:
build_strategy = fluid.BuildStrategy()
build_strategy.sync_batch_norm = True
binary = fluid.compiler.CompiledProgram(tp).with_data_parallel(
loss_name=loss_mean.name,
build_strategy=build_strategy)
* Optimize key creation of INT8 pool kernel to improve the peformance of ResNet-50 and MobileNet, especially for latency.
test=develop
* Optimize key creation of pool fp32 grad.
test=develop
* - Implemented draft of primitive desc keeping in Tensor
test=develop
- TransposeMKLDNNHandler::AcquireSrcMemory was reimplemented
- Added nchw and nc formats setting for sake of compatiblity
Fixed unit tests
- Worakaround to problem with 5D data in conv
- Added 3D and 1D MKL-DNN formats for name handles for tensor
test=develop
- Fix to UTs
test=develop
- Conv fp32 op was updated
Cosmetic fixes
test=develop
- tensor mkldnn cosmetics
test=develop
- Moved most of mkl-dnn specific code from Tensor to mkl-dnn utils
* - Lint fixes
test=develop
* - setting prim dec in Tensor , sets also layout to kMKLDNN
test=develop
* - Moved creation of prim desc totally out of Tensor
test=develop
* - Cosmetic fixes adter review
test=develop
* Optimize for gelu operator
* Set up the low accuracy mode of MKL ERF function.
test=develop
* Only enable MKLML ERF when OS is linux
* Use the speical mklml version included vmsErf function to verify gelu mkl kernel.
test=develop
* Add the CUDA macro to avoid NVCC's compile issue.
test=develop
* Add the TODO comments for mklml library modification.
test=develop
* Clean Code
test=develop
* Add the comment of marco for NVCC compiler.
test=develop
* Enable momentum operator for a ngraph engine
test=develop
* Update tests
test=develop
* Unnecessary line of the code as intended was removed
test=develop
* Refine the beam_search op and test.
* A basic CUDA implementation of beam_search for small batch_size.
* Implement CUDA kernel for beam_search_op.
* Use multiple CUDA threads in the same block to select the top beam.
* Update the python api of beam_search op.
* Enable extend function in CPU kernel of beam_search op.
* Unify the CUDA codes.
test=develop
* Unify the CPU kernel of beam_search op.
* Ensure the seletced items of beam_search_op's CPU kernel sorted by scores.
* Update the description of beam_search in API.spec.
* Enable the use of CUDA kernel in beam_search op.
* Exclude the beam_search's CUDA unittest when there is no CUDA gpu, and delete some debuging statements.
test=develop
* Follow comments.
test=develop
* Call the CPU kernel for beam_search op when batch_size > 4.
test=develop
* Remove the except of is_empty op in PrepareData.
test=develop