* fix fetch handler problem and refactor
when a user define FetchHandler class, he or she should initialize a handler
with variable dict. the key of a variable dict is a user defined name,
the value of a variable dict is a Varaible generated from python API.
For each fetching, a user should implement handler function in which
fetched_result_dict will be available and the user can access the fetched value
with user defined keys.
* Disable fusion_group pass for windows and mac. We will do some experiments on Linux first.
test=develop
* Print the subgraph when check failed.
test=develop
* Enable generating code for a given subgraph.
* Support sorting the subgraph.
* Remove the rearange of expressions because we use the sorted subgraph directly.
* Enable generating code for a subgraph which is composed of grad ops.
* Use expression information to check the accuracy in unittest.
* Separate load and store from computation expressions.
test=develop
* Improve the loading statements in generated codes.
test=develop
* Remove unused arguments from formal list.
test=develop
* copy some feasigns and corresponding embeddings from one sparse table to another
* copy all feasigns and corresponding embeddings from one sparse table to another
* copy all dense params from one table to another
* copy some local vars to other local vars
* Add the check of lod_level between compile-time and runtime.
test=develop
* Fix bug in check_compile_vs_runtime.
test=develop
* Fix the check of output when it is dispensiable or intermediate.
test=develop
* Share lod of x to out in match_matrix_tensor op in compile-time.
* Implement GetLoDLevel in InferShapeContext.
* Set the default value of check_compile_vs_runtime to False and enable it in test_sequence_pad_op.
test=develop
* Enable check_compile_vs_runtime in test_match_matrix_tensor.
* Add the implementation of SetLoDLevel in InferShapeContext.
* Remove the implementation of IncreaseLoDLevel and call Get/SetLoDLevel instead.
* Remove the implementation of DecreaseLoDLevel and call Set/GetLoDLevel instead.
* Refine some ops and unittests.
test=develop
* Fix a typo.
test=develop
* Remove the check of var type, and change int to int32_t.
test=develop
* Add unittest for Get/SetLoDLevel.
test=develop
* Add the definition of operation in fusion_group.
* Use operations in OperationMap to detect fusion_group of elementwise pattern.
* Add namespace fusion_group in code_generator.
* Use operations recorded in OperationMap to generate code.
* Remove implementation codes to .cc file.
* Refine Operation and CodeGenerator to make it easier to generate code for grad_op.
Refine the unittest for better reuse.
* Avoid recording the template's keyword in a array.
* Support the generating of code for grad_op and add unittest.
test=develop
* Remove replaced_element_in_order and use use number instead.
test=develop
* Enrich the type of error and declare the error type interfaces, test=develop
* adjust tests to adapt new form, test=develop
* add inference deps with error_codes.pb.h, test=develop
* restore stack iter start pos, test=develop
* polish code based review comments, test=develop
* remove duplicate code and duplicate config of master+patch
* drop all ins which has conflict slot or size < merge_size
* user only need to set merge size,if ins num of same id is not equal to merge size, just drop these ins
* user must make sure master data and patch data has no same slot whose feasigns are both non-zero, otherwise these ins will be dropped. (slot list should still be the same of both master and patch)
* test=develop
* support no need buffer vars in dygraph, test=develop
* fix inference compilation error, test=develop
* update no_need_buffer_vars_inference, test=develop
* add unittests for no_need_buffer_vars_context, test=develop
* refine no_need_buffer_vars by return ref, test=develop
* polish some codes, test=develop
* Refine the cache of program, context and scope in executor.
test=develop
* Refine the unittest test_executor_and_use_program_cache.
* Add the test the PaddingRNN with use_program_cache=True.
test=develop
* Remove a check.
test=develop
* Refine the unittest to check whether it is correct when setting use_program_cache=True.
test=develop
* Refine the InferShape of ReadFrom and WriteTo op, and add comment to explain why not call ShareLoD for runtime.
test=develop
* Add comment for ReorderLoDTensorByRank op.
* Add comment for lod_tensor_to_tensor_array op to explain why only call DecreaseLoDLevel for compile time.
test=develop
* ShrinkRNNMemory op should call ShareLoD for compile time.
test=develop
* Add the implementation of IncreaseLoDLevel and add the compile-time check of lod_level in InferShape of sequence_pool.
test=develop
* Refine the unittest of DynamicRNN.
test=develop
* Change PADDLE_ENFORCE to PADDLE_ENFORCE_NE.
test=develop
* Add fusion_group_pass and elementwise pattern.
* Rewrite the detector of elementwise group.
test=develop
* Add a comment in codegen.
* Add more unittest cases.
test=develop
* Move code_generator related code to fusion_group directory.
* Correct the including path.
* Add the definition of SubGraph and finish the insert of fusion_group op in pass.
* Insert graph_vis_pass in tester to visualize the graph for debug.
* replace part of the old implementation, test=develop
* restore concat op, test=develop
* update all ops implemention & delete GetDataTypeOfVar func, test=develop
* no longer need to define all embedding layers (no one less) of all slots in each program. make trainer_param repeated in ps.proto.
* add find_distributed_lookup_table_grads instead of hard code GRAD
* support embedding stop gradient. push sparse has error before fix this.*
* fix fill sparse, skip slots which do not have embedding. each slot's embedding in a sparse table should be used in all training programs before fix this.
* fix pull sparse, skip slots which do not have embedding.
* fix collect feasign label info, skip slots which do not have embedding.
* support when there are multi sparse tables in one or multi training programs, each program can pull/push its own related sparse tables instead of all sparse tables.
* test=develop
* - Flushing mkl-dnn cache
test=develop
- Disabled clearing cache for LoadModel
- Added clearing of mkl-dnn cache when Executor is created
test=develop
- Do not clear for GPU places
test=develop
- compilation fix
test=develop
* - Moved clearing of mkl-dnn cache in destructor of executor
test=develop
* - Compilation fix
test=develop
- Reverted conditional clearing of mkl-dnn cache in Executors's
destructor
test=develop
- compilation fix
* How to write custom op needs to follow framework OP spec.
* Package fluid_framework.so and headers into whl.
* Add paddle.sysconfig.get_include() and paddle.sysconfig.get_lib() to get include dir and lib dir.
* Export some C-APIs to merge OpInfo between core.so and custom_op.so.
* Add unit testing.
* Update API.spec.
* Follow Wangzhen's comment in PR 18970, test=develop
* Review comments, test=develop
* Leave fake quantization around mul
test=develop
* Replace Fake with Real Quantized Mul
test=develop
* Fix bug in quantize placement pass
Nodes in the graph now have checked type instead of node name when they are to be marked for quantization test=develop
The new "fluid.data" changes old "fluid.layers.data":
1. Add shape and dtype check.
2. Remove "append_batch_size" parameter. We won't offer this in the new data layer because other deep learning platforms don't have this kind of data layer pre-processing. It may confuse users.
3. Remove "stop gradient" parameter because the data layer doesn't do back-propagation
TODO:
Now data layer feeded by executor is checked, will we want to check the feed data of readers in the future?
* support change shuffle thread num
* support change train thread num
* fix receive shuffle data of each channel
* data norm stop gradient
* add check thread_tensor type and root_tensor type when merge metric
* remove sleep in shuffle, add config
* add config of pslib client to client communication
* fix xbox str
* add data norm op testcase
* add flush in trainer finalize
* Set states of recurrent op as dependent vars in prune of save inference model
This PR will fix the save/load inference model problem of RNN models.
The reason of the bug is that save_inferenc_model will prune OPs that doesn't contribute to Output. But in recurrent_op, States are not Output, OPs refers States will be pruned.
This fix adds States of recurrent_op as dependent var so that OPs referring States won't be pruned.
* Add fc_elementwise_layernorm_fuse pass and unittest.
* Add fused_fc_elementwise_layernorm op and its GPU kernel.
test=develop
* Apply fc_elementwise_layernorm_fuse_pass to GPU inference.
* Add the setting of attrs in the definition of binary_op.
test=develop
* Add comment.
* Implement the unittest.
test=develop
* Change the unittest name of layer_norm.
test=develop
* refactor dygraph,test=develop
* fix failed unittest,test=develop
* polish code,test=develop
* check windows ci error,test=develop
try to fix windows ci error by np.allclose,test=develop
* polish vlog and profiler, test=develop
* try to fix preceding ops order,test=develop
* test transformer in windows ci, test=develop
* use python c-api to speed up tracer.trace,test=develop
* test=develop, fix docker with paddle nccl problem
* test=develop, add ut for debug string and gradient_accumulator
* test=develop, add tests for layer/gradient_accumulator/prepared_op
* test=develop, fix complie error for test_prepared_op
* test=develop, add more ut for dygraph
* test=develop, create API.spec for dygraph api change
* add transform_data to dygraph
* test=develop, refoctor name to make it easier to understand
* test=develop, refoctor name to make it easier to understand
* add test and change input to const ref for safety
* test=develop, fix multi-gpu failed problem , add Tracer tests, change PADDLEENFORCE to PADDLEENFORCE_EQ
* add ut for data transform
* refine ut for data_transform
* test=develop, fix ut failed on parallel se-resnext
* test=develop, change one more PADDLE_ENFORCE
* add test_tracer on multiple devices
* test=develop, change place to mutable for data transform
* test=develop, add transform data on same place test and remove useless log
* test=develop, Add to do for data layout and and ut for conv2d with no bias
* Refine the codes related to fc op.
* Add GPU implementation for fc functor.
* Apply fc_fuse_pass in GPU inference.
test=develop
* Change the cmake for fc op.
* Change PADDLE_ENFORCE to PADDLE_ENFORCE_EQ.
* Add an attribute to set the activation type in fc_op.
* Enhance the unittest of fc_op.
test=develop
* Remove the declaration of FCOpGrad back to the header file.
test=develop
* Set default value for newly added arguments in test_fc_op.
test=develop
* Enhance fc_fuse_pass to enable fusing relu.
* Allow print the shapes of var_desc in graph.
test=develop
* Enhance fc_fuse_pass_tester.
* Remove the use of PADDLE_ENFORCE.
test=develop
* Correct the number of ops after fusing.
test=develop
* Fix a typo.
test=develop
* Set activation_type to null when there is no relu in fc.
test=develop
* Refine fc_fuse_pass's codes.
* Enable the set of shape for tensor.
* Refine repeated_fc_relu_pass and add unittest.
test=develop
* Open fuse all reduce op
test=develop
* Add Fuse optimization op log
* Add log in fuse_optimizer op pass and fuse all_reduce op pass
* replace with boost::optional<bool>
test=develop
* Polish code
test=develop
* fix code coverage
test=develop
TemporaryAllocator is a singleton used for allocating memory for Cudnn. Since it is a singleton, we can delete it for better performance in memory.
We replace TemporaryAllocator by CUDADeviceContextAllocator and CUDADeviceContextAllocation, which uses stream callback to delete the memory allocated for the stream to avoid singleton.
Also added data_feed_proto to operator to fix CI in CPU compilation
* Refine the codes related to fc op.
* Add GPU implementation for fc functor.
* Apply fc_fuse_pass in GPU inference.
test=develop
* Change the cmake for fc op.
* Change PADDLE_ENFORCE to PADDLE_ENFORCE_EQ.
* Add an attribute to set the activation type in fc_op.
* Enhance the unittest of fc_op.
test=develop
* Remove the declaration of FCOpGrad back to the header file.
test=develop
* Set default value for newly added arguments in test_fc_op.
test=develop
* Add a interface to enable cudnn for inference.
* Add cudnn_placement_pass.
test=develop
* Set the default value of cudnn_enabled_op_types to null.
test=develop
* Write the common basic class, placement_pass_base, to refine the codes.
test=develop
* Call EnableCUDNN in unittest.
test=develop
* Refine cudnn_placement_pass tester.
* Enable the testing of cudnn_placement_pass in inference's unittest.
test=develop
* Add the check of op kernels.
test=develop
* Support looking up embeddings from BoxPS.
* Add a _pull_box_sparse op, for now this op is not exposed to users.
* Add a BoxHelper class, providing 'BeginPass', 'EndPass', 'FeedPass' functions and so on.
* Add 'BoxPSDataset' in python code.
* Add a compile options WITH_BOX_PS and a MACRO PADDLE_WITH_BOX_PS.
* Add UT.
* More concrete information pls refer to: https://github.com/PaddlePaddle/Paddle/pull/18982
- Refactor step 1
- Compilation fix
- Yet another compilation fix
- Even more compilation fix
- Lint fixes
test=develop
- Removed deprectaed PADDLE_ENFORCE occurance
test=develop
- Candidate fix to BN forward
- Lint fixes
test=develop
- Refactoring in data_layout_transform
- compilation fix
- Another comppilation fix
- Step further into darkness
- Yet another compilation fix
- Yet another compilation fix
- missing header
- compilation fix
- Added MKLDNN -> Paddle conversion in fetch op
test=develop
- Compilation fix
test=develop
- Lint
test=develop
- Mul fix
- Fix to MKLDNN MUL op and Elementwise MUL UT
test=develop
- Workaround for diffrent weights with groups representation Paddle vs
MKL-DNN.
test=develop
- Candidate fix for 5D convolution with groups
- Refactor of fix for conv3d and conv2d in fetch op
test=develop
- Compilation fix
- Still same compilation fix
- Compilation fix
- Compilation fix
- Reverted refactoring of fixes
- Adapted test_conv2d_int8_mkldnn so it exects data in NCHW format
not NHWC
test=develop
- minor fix in UT
test=develop
- Lint fixes
test=develop
* Add simplify_with_basic_ops_pass to replace dropout_op with scale_op when is_test is true.
test=develop
* Delete dropout_op directly when upscale_in_train is true.
test=develop
* Improve the debug string, adding the print of op_desc information.
* Fix the case when dropout's input x is reused as the next op's output.
* Add the pass to inference.
test=develop
* Change the log level.
test=develop
* Add unittest for inplace case.
* Add comment to explain the pass.
* Apply the pass for CPU inference.
test=develop
* Fix the typo.
test=develop
* Add the check of AttrType.
test=develop
* fix correctness of the communicator
* fix a bug in send thread when sending var context is empty, test=develop
* add lookup_table_prefetch_op and prefetch optimize, test=develop
* remove remote prefetch GPU supported
* word2vec force with CPU, test=develop
* test dist remote lookup table force with CPU, test=develop
* add pybind interface to get all inplace ops, test=develop
* enhance OpTest to check whether the consistency of operator when using and not using inplace, test=develop
* handle corner cases in op_test, test=develop
* support outputs without tensor holder_, like XShape in reshape_op, test=develop
* fix bug, some op has GradOpMaker, but actually no grad_op in OpInfoMap, test=develop
* use reshape_grad instead of reshape in FlattenGradOp, test=develop
* fix error debug dims info for variables like XShape, test=develop
* change computational order in sum_op to relieve computation difference using inplace, test=develop
* add inplace_atol to check group_norm, and skip inplace_grad for mkldnn, test=develop
* follow sneaxiy's comments, test=develop
* remove unused DefaultGradOpDescMaker in mkldnn op, test=develop