* make OpTest check grad inplace even if forward has no inplace, test=develop
* do not run PE when enable_inplace is False, test=develop
* add conv3d cuda kernel for float16 type, test=develop
* refactor OpTest for inplace, test=develop
* add comments, test=develop
* add pybind interface to get all inplace ops, test=develop
* enhance OpTest to check whether the consistency of operator when using and not using inplace, test=develop
* handle corner cases in op_test, test=develop
* support outputs without tensor holder_, like XShape in reshape_op, test=develop
* fix bug, some op has GradOpMaker, but actually no grad_op in OpInfoMap, test=develop
* use reshape_grad instead of reshape in FlattenGradOp, test=develop
* fix error debug dims info for variables like XShape, test=develop
* change computational order in sum_op to relieve computation difference using inplace, test=develop
* add inplace_atol to check group_norm, and skip inplace_grad for mkldnn, test=develop
* follow sneaxiy's comments, test=develop
* remove unused DefaultGradOpDescMaker in mkldnn op, test=develop
* add fp16 backward support
test=develop
* add sum_op fp16 test
* disable test_dist_save_load
test=develop
* add check_grad for sum
* add unit test for softmax_grad fp16
test=develop
* add scale_op unit test
* add mul_grad_op unit test for fp16
* add cross_entropy_grad and eman_grad unit test for fp16
test=develop
* fix cross_entropy unit test
* add pool2d fp16 unit test
* refine conv2d fp16 unit test
test=develop
* refine activation unit test
test=develop
* fix ci
test=develop
* follow zhihong's comment, copy from https://github.com/PaddlePaddle/Paddle/pull/12796
test=develop
* Add intermediate
* fix flatten/squeeze/unsqueeze
* Considering compatibility issues, we could not fix the origin op
* follow comment
* reset the shape of XShape
* Enhance the function of fused_elementwise_activation_op
* enhance unit test
* Clean Code And Add Doc
* Add compound functors
* Fix doc and enhance unit test
* define Dx and Dy for d_binary_func
* add mul_scale
* add mul_scale
* add elementwise_mul
* code refine
* code refine
* add doc
* add AsIntermediate
* add lod_tensor util and modify pybind
* refind pybind LoDTensor API and modify LoDTensor and DataFeeder test
* fix test error
* fix detection map op test
* fix reorder_lod_tensor test
* fix seq_concat_op
* fix chunk evel op test
* fix target assign op
* fix warp ctc op
* address comments step 1: reverse reset_lod op
* step 2: modify op test
* add warning message
* remove has_valid_lod
* add back has_valid_lod
* address comments
* add exception catching trial