* static model runner basic implement, test=develop
* add run program op to execute loaded program, test=develop
* refactor static model runner & run program op, test=develop
* reset engine.cc to resolve conflict
* adapt the change of dygraph double grad, test=develop
* refactor impl to solve control flow error, test=develop
* clear debug code, test=develop
* fix ci str compatible error & checkout dygraph grad maker & add example, test=develop
* hide api & add op test, test=develop
* fix run program op test places error, test=develop
* fix program by review comment, test=develop
* delete change var desc name, test=develop
* fix other program by review comment, test=develop
* remove _static_graph_guard, test=develop
* add selectedrows test, test=develop
* remove desc parser, test=develop
* fix detail program, test=develop
* change socpe create & add test, test=develop
* sequential reader stage 1, test=develop
* fix ut, test=develop
* fix iterable=False reset bug, add some logs and polish code, test=develop
* inference feed partial data, test=develop
* Turn on keep_order=True for test, test=develop
* enhance ut to test more cases, test=develop
* test commit for reverting
* Revert "test commit for reverting", test=develop
This reverts commit 80aef42ef52ba1ee79627d6f663a624ec4f12f58.
* add ut of merged and unmerged results, test=develop
* add more uts for coverages and add en doc of api, test=develop
* follow comments, test=develop
* change note style, test=develop
* update ScopeBufferedSSAGraphExecutor&AsyncSSAGraphExecutor&ThreadedSSAGraphExecutor&FastThreadedSSAGraphExecutor&ParallelSSAGraphExecutor&ParallelExecutor for fetching unmerged results.
* add the unit test for fetch_unmerged.
* update ut for multi-card and multi-cpu.
* add the error message and the user suggestion in FetchOpHandle. test=develop
* Add two types of Metric Calculator: MultiTaskCalculator & CmatchRankCalculator.
* Add a config for DynamicAdjustChannelNum function to denote whether we will discard the remaining instances when they are not be distributed evenly.
* Remove CPU code in Pull/PushSparse and we will add it back when testing it fully.
* Fix some known issues: such as copying persistable vars after one epoch running.
Refine PaddleBox Framework, Main functions:
* Add MetricMsg util class, which can calculate metrics like AUC, bucket_error, COPC.
* Replace FeedPass with new interface: BeginFeedPass & EndFeedPass
* Refactor Pull/Push Sparse Function in box_wrapper.
* Use CUDA Kernel to copy keys and copy feasign between tensor and boxps struct.
* Cache copied keys in pull sparse in order to reuse it in push period.
* Add the first implememtation of fusion_group op #19621 (#3)
* Add the dynamic load of nvrtc, and support runtime compiling of CUDA kernel using nvrtc.
test=develop
* Call CUDA driver api to launch the kernel compiled by nvrtc.
test=develop
* Disable for mac and windows.
test=develop
* Refine the codes to support manually specified num_threads and workload_per_thread.
test=develop
* Refine the CUDA kernel to support large dims.
test=develop
* Add DeviceCodePool to manage all device codes.
* Add the first implementation fusion_group op.
* Add unit-test for fusion_group op.
* Add the check of result.
* Add the check of nvrtc in unit-test.
test=develop
* Add comment to explain the inputs, outputs and features of fusion_group op.
test=develop
* Disable fusion_group op for mac and windows.
test=develop
* Make the compiling of device code return status instead of hanging up.
test=develop
* Add the check of whether there is CUDA driver library, and do not core dump when failing to call the CUDA driver API.
* Unify fusion_group_op's input and output names.
test=develop
* Add the check of CUDA driver library in unittest.
test=develop
* Enable generating code for a given subgraph. #21126 (#4)
* Enable generating code for a given subgraph.
* Support sorting the subgraph.
* Remove the rearange of expressions because we use the sorted subgraph directly.
* Enable generating code for a subgraph which is composed of grad ops.
* Use expression information to check the accuracy in unittest.
* Separate load and store from computation expressions.
test=develop
* Improve the loading statements in generated codes.
test=develop
* Remove unused arguments from formal list.
test=develop
* Enable the detection of subgraph of grad ops.
* Generate code for detected subgraph in fusion_group_pass.
* Add an option in BuildStrategy to enable fusion_group_pass and add unittest.
test=develop
* Fix a bug when checking whether the shape of all inputs are the same.
* Add debug information.
* Remove subgraph_detector from inference/analysis to the common framework/ir directory. (#5)
test=develop
* Call subgraph_detector in fusion_group pass.
test=develop
* Disable fusion_group when WITH_GPU is OFF.
test=develop
* Refine all PADDLE_ENFORCE message.
test=develop
* Fix the case that some inputs are not defined in grad ops, and set op_role for fused op.
test=develop
* Follow review comments.
test=develop
* Enable quantize to reorder to nchw as well
* Correct FC MKL-DNN input dim requirements to accept 3D
* Improve DNNL FC format, error and 3D input handling
test=develop
* Improve error checking in FC
test=develop
* Improve PADDLE_ENFORCE messages in fc-related files
* Remove data layout attribute from obligatory pass args
test=develop
* Fix message in fc_mkldnn_pass to be logically correct
test=develop
* Implement a common python unittest to test the ir passes.
test=develop
* Save the results in np.array and support to startup on CPU.
test=develop
* Fix the unittest.
test=develop
* Add check_program to check whether the optimized program is different from the origin one.
test=develop
* Remove the inferface all_ops.
test=develop
* Add exception test in pass_test.
test=develop
* add bn and relu fuse pass
* add op attr assert and dtype assert
* fix some inputs&&outputs bugs for the fused op and pattern.
* add the unittest for fuse_bn_act_pass. test=develop
* use normative enforce statements. test=develop
* add the cpu test. test=develop
* add the support of batch_size=1 for the bn with relu op. test=develop
* add the error type for paddle throws. test=develop
* add fused_batch_norm_act and fused_batch_norm_act_grad to op_has_unsed_vars_white_list. test=develop
* Polish the PADDLE_ENFORCE in fusion_group pass related codes.
test=develop
* Correct the unittest because of the change relu_grad's formula.
test=develop
* Add the dynamic load of nvrtc, and support runtime compiling of CUDA kernel using nvrtc.
test=develop
* Call CUDA driver api to launch the kernel compiled by nvrtc.
test=develop
* Disable for mac and windows.
test=develop
* Refine the codes to support manually specified num_threads and workload_per_thread.
test=develop
* Refine the CUDA kernel to support large dims.
test=develop
* Add DeviceCodePool to manage all device codes.
* Add the first implementation fusion_group op.
* Add unit-test for fusion_group op.
* Add the check of result.
* Add the check of nvrtc in unit-test.
test=develop
* Add comment to explain the inputs, outputs and features of fusion_group op.
test=develop
* Disable fusion_group op for mac and windows.
test=develop
* Make the compiling of device code return status instead of hanging up.
test=develop
* Add the check of whether there is CUDA driver library, and do not core dump when failing to call the CUDA driver API.
* Unify fusion_group_op's input and output names.
test=develop
* Add the check of CUDA driver library in unittest.
test=develop
* Refine the calling of PADDLE_ENFORCE.
test=develop
* optimize adam speed by removing _finish_update test=develop
* fix SparseAdamFunctor param list test=develop
* Remove scale_op in expect_list of adam_op test=develop
* fix test optimizer loss assert error test=develop
* fix test optimizer loss assert error test=develop
* modify PADDLE_ENFORCE usage test=develop
* fix op_type in lamb_op.cc test=develop
* fix errors ostream format bug test=develop
* add betaPowOut in ngraph op test=develop
* fix ngraph::op api for gcc8 test=develop
* clean code test=develop
* modify struct into class test=develop
* remove code of beta1Tensor in lamb_op test=develop
* fc-dequantize squash
test=develop
* change according to reviews
test=develop
* change PADDLE_ENFORCE
test=develop
* add second test when fc-dequant do not fuse
test=develop
* change all related PADDLE_ENFORCE
test=develop
* add fake init for the trainer, fix large memory hold in the trainer
* do not merge recv vars from a remote endpoint, test=develop
* add recv and save op, merge slice var in one op, save memory
* remove hsigmoid with pull sparse, test=develop
* add file check_op_desc.py and add interface to get default value. test=develop
* add test for c++ coverage rate. test=develop
* Correct typo. test=develop
* Commit before merging develop
test=develop
* Backup after working with Huihuang logs
* Commit before deleting Huihuang debug loggings
* Commit before debug
test=develop
* Fix bug commit
test=develop
* Backup of fixing bugs
test=develop
* Clean up code
test=develop
* Fix a bug in sum_op
test=develop
* Implement Int8 FC
* Integrate FC into INT8v2
test=develop
* int8 FC: transpose weights before computing scales
test=develop
* Add support for activation_type string in FC
test=develop
* Disable MKL-DNN's FC in VGG16 and 19
test=develop
* Disable FC quantization when mkldnn FC is disabled
test=develop
* Solve PADDLE_ENFORCES in FC int8
* Fix Paddle enforces and remove const cast
test=develop
* Fix style changes
test=develop
* Fix quantizer_tester test and add fc quantization
test=develop
* Fix FC test fail on CUDA
* Remove unnecessary log from quantize placement pass
test=develop
* Add Thread ID to FC hash key
test=develop
* Add comments to MKL-DNN FC Kernel
test=develop
* Refactor quantizer
test=develop
* Fix linter issues
test=develop
* Fix crash in slim googlenet
test=develop
* Fix PADDLE_ENFORCE messages
test=develop
* Add fc padding to solve mkl performance
test=develop
* fix gpu pass and error information
test=develop
* fix fc_fuse_pass_test
test=develop
* fix error information
test=develop
* fix error information
test=develop
* fix name and add fc op padding test
test=develop
* fix attributes
test=develop
* optimize fc padding
test=develop
* fix test
test=develop
* fix fetch handler problem and refactor
when a user define FetchHandler class, he or she should initialize a handler
with variable dict. the key of a variable dict is a user defined name,
the value of a variable dict is a Varaible generated from python API.
For each fetching, a user should implement handler function in which
fetched_result_dict will be available and the user can access the fetched value
with user defined keys.
* Disable fusion_group pass for windows and mac. We will do some experiments on Linux first.
test=develop
* Print the subgraph when check failed.
test=develop
* Enable generating code for a given subgraph.
* Support sorting the subgraph.
* Remove the rearange of expressions because we use the sorted subgraph directly.
* Enable generating code for a subgraph which is composed of grad ops.
* Use expression information to check the accuracy in unittest.
* Separate load and store from computation expressions.
test=develop
* Improve the loading statements in generated codes.
test=develop
* Remove unused arguments from formal list.
test=develop
* copy some feasigns and corresponding embeddings from one sparse table to another
* copy all feasigns and corresponding embeddings from one sparse table to another
* copy all dense params from one table to another
* copy some local vars to other local vars
* Add the check of lod_level between compile-time and runtime.
test=develop
* Fix bug in check_compile_vs_runtime.
test=develop
* Fix the check of output when it is dispensiable or intermediate.
test=develop
* Share lod of x to out in match_matrix_tensor op in compile-time.
* Implement GetLoDLevel in InferShapeContext.
* Set the default value of check_compile_vs_runtime to False and enable it in test_sequence_pad_op.
test=develop
* Enable check_compile_vs_runtime in test_match_matrix_tensor.
* Add the implementation of SetLoDLevel in InferShapeContext.
* Remove the implementation of IncreaseLoDLevel and call Get/SetLoDLevel instead.
* Remove the implementation of DecreaseLoDLevel and call Set/GetLoDLevel instead.
* Refine some ops and unittests.
test=develop
* Fix a typo.
test=develop
* Remove the check of var type, and change int to int32_t.
test=develop
* Add unittest for Get/SetLoDLevel.
test=develop
* Add the definition of operation in fusion_group.
* Use operations in OperationMap to detect fusion_group of elementwise pattern.
* Add namespace fusion_group in code_generator.
* Use operations recorded in OperationMap to generate code.
* Remove implementation codes to .cc file.
* Refine Operation and CodeGenerator to make it easier to generate code for grad_op.
Refine the unittest for better reuse.
* Avoid recording the template's keyword in a array.
* Support the generating of code for grad_op and add unittest.
test=develop
* Remove replaced_element_in_order and use use number instead.
test=develop
* Enrich the type of error and declare the error type interfaces, test=develop
* adjust tests to adapt new form, test=develop
* add inference deps with error_codes.pb.h, test=develop
* restore stack iter start pos, test=develop
* polish code based review comments, test=develop
* remove duplicate code and duplicate config of master+patch
* drop all ins which has conflict slot or size < merge_size
* user only need to set merge size,if ins num of same id is not equal to merge size, just drop these ins
* user must make sure master data and patch data has no same slot whose feasigns are both non-zero, otherwise these ins will be dropped. (slot list should still be the same of both master and patch)
* test=develop
* support no need buffer vars in dygraph, test=develop
* fix inference compilation error, test=develop
* update no_need_buffer_vars_inference, test=develop
* add unittests for no_need_buffer_vars_context, test=develop
* refine no_need_buffer_vars by return ref, test=develop
* polish some codes, test=develop
* Refine the cache of program, context and scope in executor.
test=develop
* Refine the unittest test_executor_and_use_program_cache.
* Add the test the PaddingRNN with use_program_cache=True.
test=develop
* Remove a check.
test=develop
* Refine the unittest to check whether it is correct when setting use_program_cache=True.
test=develop
* Refine the InferShape of ReadFrom and WriteTo op, and add comment to explain why not call ShareLoD for runtime.
test=develop
* Add comment for ReorderLoDTensorByRank op.
* Add comment for lod_tensor_to_tensor_array op to explain why only call DecreaseLoDLevel for compile time.
test=develop
* ShrinkRNNMemory op should call ShareLoD for compile time.
test=develop
* Add the implementation of IncreaseLoDLevel and add the compile-time check of lod_level in InferShape of sequence_pool.
test=develop
* Refine the unittest of DynamicRNN.
test=develop
* Change PADDLE_ENFORCE to PADDLE_ENFORCE_NE.
test=develop
* Add fusion_group_pass and elementwise pattern.
* Rewrite the detector of elementwise group.
test=develop
* Add a comment in codegen.
* Add more unittest cases.
test=develop
* Move code_generator related code to fusion_group directory.
* Correct the including path.
* Add the definition of SubGraph and finish the insert of fusion_group op in pass.
* Insert graph_vis_pass in tester to visualize the graph for debug.
* replace part of the old implementation, test=develop
* restore concat op, test=develop
* update all ops implemention & delete GetDataTypeOfVar func, test=develop
* no longer need to define all embedding layers (no one less) of all slots in each program. make trainer_param repeated in ps.proto.
* add find_distributed_lookup_table_grads instead of hard code GRAD
* support embedding stop gradient. push sparse has error before fix this.*
* fix fill sparse, skip slots which do not have embedding. each slot's embedding in a sparse table should be used in all training programs before fix this.
* fix pull sparse, skip slots which do not have embedding.
* fix collect feasign label info, skip slots which do not have embedding.
* support when there are multi sparse tables in one or multi training programs, each program can pull/push its own related sparse tables instead of all sparse tables.
* test=develop
* - Flushing mkl-dnn cache
test=develop
- Disabled clearing cache for LoadModel
- Added clearing of mkl-dnn cache when Executor is created
test=develop
- Do not clear for GPU places
test=develop
- compilation fix
test=develop
* - Moved clearing of mkl-dnn cache in destructor of executor
test=develop
* - Compilation fix
test=develop
- Reverted conditional clearing of mkl-dnn cache in Executors's
destructor
test=develop
- compilation fix
* How to write custom op needs to follow framework OP spec.
* Package fluid_framework.so and headers into whl.
* Add paddle.sysconfig.get_include() and paddle.sysconfig.get_lib() to get include dir and lib dir.
* Export some C-APIs to merge OpInfo between core.so and custom_op.so.
* Add unit testing.
* Update API.spec.
* Follow Wangzhen's comment in PR 18970, test=develop
* Review comments, test=develop
* Leave fake quantization around mul
test=develop
* Replace Fake with Real Quantized Mul
test=develop
* Fix bug in quantize placement pass
Nodes in the graph now have checked type instead of node name when they are to be marked for quantization test=develop
The new "fluid.data" changes old "fluid.layers.data":
1. Add shape and dtype check.
2. Remove "append_batch_size" parameter. We won't offer this in the new data layer because other deep learning platforms don't have this kind of data layer pre-processing. It may confuse users.
3. Remove "stop gradient" parameter because the data layer doesn't do back-propagation
TODO:
Now data layer feeded by executor is checked, will we want to check the feed data of readers in the future?
* support change shuffle thread num
* support change train thread num
* fix receive shuffle data of each channel
* data norm stop gradient
* add check thread_tensor type and root_tensor type when merge metric
* remove sleep in shuffle, add config
* add config of pslib client to client communication
* fix xbox str
* add data norm op testcase
* add flush in trainer finalize
* Set states of recurrent op as dependent vars in prune of save inference model
This PR will fix the save/load inference model problem of RNN models.
The reason of the bug is that save_inferenc_model will prune OPs that doesn't contribute to Output. But in recurrent_op, States are not Output, OPs refers States will be pruned.
This fix adds States of recurrent_op as dependent var so that OPs referring States won't be pruned.