* sequential reader stage 1, test=develop
* fix ut, test=develop
* fix iterable=False reset bug, add some logs and polish code, test=develop
* inference feed partial data, test=develop
* Turn on keep_order=True for test, test=develop
* enhance ut to test more cases, test=develop
* test commit for reverting
* Revert "test commit for reverting", test=develop
This reverts commit 80aef42ef52ba1ee79627d6f663a624ec4f12f58.
* add ut of merged and unmerged results, test=develop
* add more uts for coverages and add en doc of api, test=develop
* follow comments, test=develop
* change note style, test=develop
* change the ci trt from version 5. to 6.0
* paddle-trt dynamic shape support init
* conv+bias or conv+bn dynamic shape support
test=develop
* modity trt engine opconvert
test=develop
* fix ci error
test=develop
* Refine adam op, test=develop
* Fuse kernels together to reduce cpu time.
* Refine paddle enforce, test=develop
* Remove some comments, test=develop
* Refine code,test=develop
* Refine cuda kernel, test=develop
* Refine code according to comments, test=develop
* Correct CPU gradients of the argsort op, form a network to test its forward and backward process, test=develop
* fix dynamic threshold error in test_argsort_op, test=develop
* add partial_concat, test=develop
* fix the grids and blocks, test=develop
* fix the Paddle_Enforce, test=develop
* fix the doc of op, test=develop
* fix the doc, test=develop
* fix the doc of the op, test=develop
* replace -1 with None, test=develop
* improve the mul_mkldnn_op line coverage
test=develop
* remove fp32 mul mkldnn kernel
test=develop
* locally refactoring
test=develop
* change according to reviews
test=develop
* Add TopK Op Grad CPU&GPU Kernel test=develop
* Add TopK Op Grad, modify grad op maker test=develop
* Add TopK Op Grad, modify grad op maker test=develop
* Add TopK Op Grad, modify PADDLE_ENFORCE test=develop
* Add TopK Op Grad, modify PADDLE_THROW test=develop
* Add TopK Op Grad, modify unittest test=develop
* fix ngraph top k op unittest test=develop
* update ops's unittest of elementwise_pow, elementwise_max, elementwise_min, scale and sqrt
1. update elementwise_pow, elementwise_max and scale's unitests with input data type (float32 -> float64)
2. fix bug that the elementwise_pow doesn't meet threshold requirements with tackling float64 data
3. remove sqrt from op_accuracy_white_list.py
4. update the unittests of elementwise_pow, elementwise_max and elementwise_min ops that their input data shape over 100
5. test=develop
* modify the writing style according suggestions
test=develop
* Add support for dynamic_decode(while) training. test=develop
* Fix assign_op and tensor_array_read_write_op after solving conflict. test=develop
* Fix test_rnn_decode_api.py. test=develop
* Refine docs for apis in rnn.py. test=develop
* Adjust outputs of dynamic_decode. test=develop
* Remove the force_cpu update in assign_op. test=develop
* Remove the force_cpu update in assign_op. test=develop
* Make RNNCell.get_initial_states support batch_dim_idx argument. test=develop
* Rename _create_array_outof_while as _create_array_out_of_while in rnn.py.
test=develop
* support slice double grad, test=develop
* merge two doublegradopmaker to one doublegradopmaker,test=develop
* change the shape of slice_OP's unittest, test=develop
Refine PaddleBox Framework, Main functions:
* Add MetricMsg util class, which can calculate metrics like AUC, bucket_error, COPC.
* Replace FeedPass with new interface: BeginFeedPass & EndFeedPass
* Refactor Pull/Push Sparse Function in box_wrapper.
* Use CUDA Kernel to copy keys and copy feasign between tensor and boxps struct.
* Cache copied keys in pull sparse in order to reuse it in push period.
* Add log in memory::Copy for debug purpose.
* Change to use context in DeviceContextPool directly in sequence_pooling_test, instead to new one.
* Change to use context in DeviceContextPool directly in sequence_padding_test, instead to new one.
test=develop
* Change the type of second_dim from size_t to int64_t.
test=develop
* Enable quantize to reorder to nchw as well
* Correct FC MKL-DNN input dim requirements to accept 3D
* Improve DNNL FC format, error and 3D input handling
test=develop
* Improve error checking in FC
test=develop
* Improve PADDLE_ENFORCE messages in fc-related files
* Remove data layout attribute from obligatory pass args
test=develop
* Fix message in fc_mkldnn_pass to be logically correct
test=develop
* add bn and relu fuse pass
* add op attr assert and dtype assert
* fix some inputs&&outputs bugs for the fused op and pattern.
* add the unittest for fuse_bn_act_pass. test=develop
* use normative enforce statements. test=develop
* add the cpu test. test=develop
* add the support of batch_size=1 for the bn with relu op. test=develop
* add the error type for paddle throws. test=develop
* add fused_batch_norm_act and fused_batch_norm_act_grad to op_has_unsed_vars_white_list. test=develop
1. Add a new input named batch_roi_nums for prroi_pool_op. batch_roi_nums includes the number of roi for each image in batch when rois is Tensor. This information is saved in rois's lod when rois is LoDTensor.
2. add grad check to prroi_pool_op and solve unnormal X grad diff in CPU.
* support elu activation double grad,test=develop
* delete the code commit in .cc,test=develop
* fix relu test unpass, test=develop
* add elu double grad kernel and unit test
* add caculate dX in elu double grad functor, test=develop
* update the commit code,test=develop
* Add the dynamic load of nvrtc, and support runtime compiling of CUDA kernel using nvrtc.
test=develop
* Call CUDA driver api to launch the kernel compiled by nvrtc.
test=develop
* Disable for mac and windows.
test=develop
* Refine the codes to support manually specified num_threads and workload_per_thread.
test=develop
* Refine the CUDA kernel to support large dims.
test=develop
* Add DeviceCodePool to manage all device codes.
* Add the first implementation fusion_group op.
* Add unit-test for fusion_group op.
* Add the check of result.
* Add the check of nvrtc in unit-test.
test=develop
* Add comment to explain the inputs, outputs and features of fusion_group op.
test=develop
* Disable fusion_group op for mac and windows.
test=develop
* Make the compiling of device code return status instead of hanging up.
test=develop
* Add the check of whether there is CUDA driver library, and do not core dump when failing to call the CUDA driver API.
* Unify fusion_group_op's input and output names.
test=develop
* Add the check of CUDA driver library in unittest.
test=develop
* Refine the calling of PADDLE_ENFORCE.
test=develop
* optimize adam speed by removing _finish_update test=develop
* fix SparseAdamFunctor param list test=develop
* Remove scale_op in expect_list of adam_op test=develop
* fix test optimizer loss assert error test=develop
* fix test optimizer loss assert error test=develop
* modify PADDLE_ENFORCE usage test=develop
* fix op_type in lamb_op.cc test=develop
* fix errors ostream format bug test=develop
* add betaPowOut in ngraph op test=develop
* fix ngraph::op api for gcc8 test=develop
* clean code test=develop
* modify struct into class test=develop
* remove code of beta1Tensor in lamb_op test=develop
* fix elementwise_pow bug on integer, test=develop
* use llrint to support elementwise_pow_grad, test=develop
* add some tests, test=develop
* revert grad functor, test=develop
* add fake init for the trainer, fix large memory hold in the trainer
* do not merge recv vars from a remote endpoint, test=develop
* add recv and save op, merge slice var in one op, save memory
* remove hsigmoid with pull sparse, test=develop
Add tests to use dy/dx to make sure the gradient values calculated by the control flow backward is correct. Also fixed bugs detected by those tests.
Fix bugs:
1. Unlike sum_op, optimizer ops don't allow uninitialized input tensor. But in conditional_block_grad_op, since the conditional_block may not run, the output gradient tensor may be uninitialized, which will cause the optimizer op error. To fix it, we should let optimizer ops support uninitialized input like sum_op or assign the uninitialized gradient to 0 when the conditional_block_grad_op doesn't run. I found there are about 10+ optimizer ops. **To be simpler, I just assign output gradient of the conditional_block_grad_op to 0 in this PR**. But it can be further explored whether we can make optimizer ops like sum_op to support uninitialized input tensor because theoretically we can speed up without the assigning in conditional_block_grad_op.
2. Infer parameter shapes during append_backward. I didn't know that all our parameters are in global block. When op_desc is inferring shapes at the sub-block, it may not know the shape of gradients of parameters whose shape information is at global block. I fixed it by inferring shapes of gradients from forward var.
This PR also did some code clean up:
1. Print the var name when sgd_op catches shape error so that it is easier to debug
2. Fix a typo: dicta -> dict
* fix the device supported of the op unique and unique_with_counts.
test=develop
test=document_fix
* Fix the precision of test in the op of unique and unique_with_counts.
test=develop
test=document_fix
* add param & grad shape check for sgd op
* add _reshape_inplece interface for dygraph parallel
* refine unittest based paddle/models scripts, test=develop
* add unittest for parallel grad fuse, test=develop
* Commit before merging develop
test=develop
* Backup after working with Huihuang logs
* Commit before deleting Huihuang debug loggings
* Commit before debug
test=develop
* Fix bug commit
test=develop
* Backup of fixing bugs
test=develop
* Clean up code
test=develop
* Fix a bug in sum_op
test=develop
* Add ascending for argsort
* Refine api doc description.
* Refine descending description
* Add int32 logic to speedup when data is small size.
* Remove int32 opt as not support in python
* Implement Int8 FC
* Integrate FC into INT8v2
test=develop
* int8 FC: transpose weights before computing scales
test=develop
* Add support for activation_type string in FC
test=develop
* Disable MKL-DNN's FC in VGG16 and 19
test=develop
* Disable FC quantization when mkldnn FC is disabled
test=develop
* Solve PADDLE_ENFORCES in FC int8
* Fix Paddle enforces and remove const cast
test=develop
* Fix style changes
test=develop
* Fix quantizer_tester test and add fc quantization
test=develop
* Fix FC test fail on CUDA
* Remove unnecessary log from quantize placement pass
test=develop
* Add Thread ID to FC hash key
test=develop
* Add comments to MKL-DNN FC Kernel
test=develop
* Refactor quantizer
test=develop
* Fix linter issues
test=develop
* Fix crash in slim googlenet
test=develop
* Fix PADDLE_ENFORCE messages
test=develop
* Add fc padding to solve mkl performance
test=develop
* fix gpu pass and error information
test=develop
* fix fc_fuse_pass_test
test=develop
* fix error information
test=develop
* fix error information
test=develop
* fix name and add fc op padding test
test=develop
* fix attributes
test=develop
* optimize fc padding
test=develop
* fix test
test=develop
* Refactor MKL-DNN ElementwiseMul
remove manual fallback, remove format attrs
test=develop
* Refine PADDLE_ENFORCEs in eltwise_mul_op.h
test=develop
* Make ElementwiseMulOp inherit from ElementwiseOp
* Change type of simd_width to int
test=develop
* Remove Constructor extensions in ElementwiseOp and ElementwiseMulOp
test=develop
* Restore attributes
test=develop
* Fix test coverage for mkldnn eltwise mul
test=develop
* Conform to new is_run_common_broadcast API
test=develop
* Add UT for AreDimsAndFormatCorrect
test=develop
* Improve argsort performance.
- Give 200000 data to compute argsort on v100,
can speed up ~190x
before opt cost: 0.53s
after opt cost:0.0027s
- Add fp16 support
* Refine error message
* Refine code
test=develop
Signed-off-by: zhaoyuchen <zhaoyuchen01@baidu.com>
* add int8 kernel to lookup_table op and add dequantize op test=develop
* change paddle_enforce to paddle_enforce_eq test=develop
* change copyright and change some not suitable code test=develop
* remove debug log test=develop
* replace GetInputType with IndicateVarDataType test=develop
* fix EmptyGradMaker test=develop
* fix diff between cpu and gpu test=develop
* use memcopy when int8_t test=develop
* open dygraph op test, test=develop
* modify to_variable, test=develop
* modify input and output for dygraph, test=develop
* modify input and output for dygraph(fix bug), test=develop
* fix input processing of dygraph op test, test=develop
* fix bug, test=develop
* fix op test, test=develop
* fix forward bug for dygraph, test=develop
* fix mkldnn op test for forward, test=develop
* update nn.py for dygraph, test=develop
* fix crop_tensor_op, test=develop
* fix elementwise_mul_op, test=develop
* fix fill_op, test=develop
* fix some mkldnn op, test=develop
* open backward op test for dygraph, test=develop
* delete log, test=develop
* close backward op test for dygraph, test=develop
* fix bug for edit_distance_op and test_lstm_cudnn_op, test=develop
* fix optest backward bug for dygraph, test=develop
* fix optest backward bug for dygraph, test=develop
* close backward op test for dygraph, test=develop
* close backward op test for dygraph, test=develop
* open dygraph op test, test=develop
* fix op test for dygraph, fix GradOpDescMaker, test=develop
* fix bug for linear_chain_crf_op.h, test=develop
* remove log, test=develop
* remove log, test=develop
* remove log for op_test.py, test=develop
* remove log for op_test.py, test=develop
* fix bug for var_conv_2d_op, change PADDLE_ENFORCE, test=develop
* fix PADDLE_ENFORCE_EQ for hierarchical_sigmoid_op.cc, test=develop
* fix bug for test_increment_ngraph_op.py, test=develop
* fix lod for op test in dygraph, test=develop
* refactor op_test.py to reduce redundant code, test=develop
* fix lod optest, modify InputVar/OutputVar to HasInput/HasOutput, test=develop
* remove debug log, test=develop
* remove redundant code in base.py, test=develop
* fix some error in optest, test=develop
* fix ClearNoNeedBufferInputs function's bug for LoDTensor, test=develop
* refactor op_test.py, test=develop
* remove redundant writing, test=develop
* fix error(get tensor of the grad variable), test=develop
* fix test_concat_mkldnn test_conv2d_mkldnn, test=develop
* fix optest.py for get tensor of LoDTensor, test=develop
* fix optest.py for get tensor of LoDTensor, test=develop
* fix optest.py for get tensor of LoDTensor, test=develop
* fix some redundant code, test=develop
* reslove conflict and rewrite paddle error message, test=develop
* fix auc drop first commit test=develop
* update datanorm op
* update datanorm with enforce test=develop
* update test=develop
* update format test=develop
* update format
* update format test=develop
* add unit test test=develop
* update unit test test=develop
* update format test=develop
* update format test=develop
* update API description test=develop
* update API description test=develop
* update format test=develop
* fix codes as comments test=develop
* fix description as comments test=develop
* fix description as comments test=develop
* update codes.. test=develop
* modified error message for conv and conv_transpose, test=develop
* modified doc of conv and conv_transpose op, test=develop
* modified the expression for error message, test=develop
* modified error message for group_norm op, test=develop
* modified detail of Attr(data_format) or Attr(data_layout)
* add ValueError in API doc for maxout op, test=develop
* Add Asypadding for conv fusion.
test=develop
reference: pr/20042
* Fix eigen build link error
* Change back file mode
* Use math function & add more checks.
* set the default value of alpha for prelu to 0.25, test=develop
* add the call to __syncthreads(), test=develop
* fix the implementation of cpu prelu, test=develop
* repair the implementation of element mode prelu, test=develop
* modify test_prelu_op.py, test=develop
* Add the check of lod_level between compile-time and runtime.
test=develop
* Fix bug in check_compile_vs_runtime.
test=develop
* Fix the check of output when it is dispensiable or intermediate.
test=develop
* Share lod of x to out in match_matrix_tensor op in compile-time.
* Implement GetLoDLevel in InferShapeContext.
* Set the default value of check_compile_vs_runtime to False and enable it in test_sequence_pad_op.
test=develop
* Enable check_compile_vs_runtime in test_match_matrix_tensor.
* Add the implementation of SetLoDLevel in InferShapeContext.
* Remove the implementation of IncreaseLoDLevel and call Get/SetLoDLevel instead.
* Remove the implementation of DecreaseLoDLevel and call Set/GetLoDLevel instead.
* Refine some ops and unittests.
test=develop
* Fix a typo.
test=develop
* Remove the check of var type, and change int to int32_t.
test=develop
* Add unittest for Get/SetLoDLevel.
test=develop
* fix bug in pool/conv/conv_transpose:
1. It should be stride[i] not stride[0] in UpdatePaddingAndDilation;
2. fix bug of func _get_padding_with_SAME in test_conv/conv_transpose_op.py;
3. fix bug of the computation process in function conv2dtranspose_forward_naive.
test=develop
* change test to make the data of different dimensions different. test=develop