* Open fuse all reduce op
test=develop
* Add Fuse optimization op log
* Add log in fuse_optimizer op pass and fuse all_reduce op pass
* replace with boost::optional<bool>
test=develop
* Polish code
test=develop
* fix code coverage
test=develop
TemporaryAllocator is a singleton used for allocating memory for Cudnn. Since it is a singleton, we can delete it for better performance in memory.
We replace TemporaryAllocator by CUDADeviceContextAllocator and CUDADeviceContextAllocation, which uses stream callback to delete the memory allocated for the stream to avoid singleton.
Also added data_feed_proto to operator to fix CI in CPU compilation
* Refine the codes related to fc op.
* Add GPU implementation for fc functor.
* Apply fc_fuse_pass in GPU inference.
test=develop
* Change the cmake for fc op.
* Change PADDLE_ENFORCE to PADDLE_ENFORCE_EQ.
* Add an attribute to set the activation type in fc_op.
* Enhance the unittest of fc_op.
test=develop
* Remove the declaration of FCOpGrad back to the header file.
test=develop
* Set default value for newly added arguments in test_fc_op.
test=develop
* Add a interface to enable cudnn for inference.
* Add cudnn_placement_pass.
test=develop
* Set the default value of cudnn_enabled_op_types to null.
test=develop
* Write the common basic class, placement_pass_base, to refine the codes.
test=develop
* Call EnableCUDNN in unittest.
test=develop
* Refine cudnn_placement_pass tester.
* Enable the testing of cudnn_placement_pass in inference's unittest.
test=develop
* Add the check of op kernels.
test=develop
* Support looking up embeddings from BoxPS.
* Add a _pull_box_sparse op, for now this op is not exposed to users.
* Add a BoxHelper class, providing 'BeginPass', 'EndPass', 'FeedPass' functions and so on.
* Add 'BoxPSDataset' in python code.
* Add a compile options WITH_BOX_PS and a MACRO PADDLE_WITH_BOX_PS.
* Add UT.
* More concrete information pls refer to: https://github.com/PaddlePaddle/Paddle/pull/18982
- Refactor step 1
- Compilation fix
- Yet another compilation fix
- Even more compilation fix
- Lint fixes
test=develop
- Removed deprectaed PADDLE_ENFORCE occurance
test=develop
- Candidate fix to BN forward
- Lint fixes
test=develop
- Refactoring in data_layout_transform
- compilation fix
- Another comppilation fix
- Step further into darkness
- Yet another compilation fix
- Yet another compilation fix
- missing header
- compilation fix
- Added MKLDNN -> Paddle conversion in fetch op
test=develop
- Compilation fix
test=develop
- Lint
test=develop
- Mul fix
- Fix to MKLDNN MUL op and Elementwise MUL UT
test=develop
- Workaround for diffrent weights with groups representation Paddle vs
MKL-DNN.
test=develop
- Candidate fix for 5D convolution with groups
- Refactor of fix for conv3d and conv2d in fetch op
test=develop
- Compilation fix
- Still same compilation fix
- Compilation fix
- Compilation fix
- Reverted refactoring of fixes
- Adapted test_conv2d_int8_mkldnn so it exects data in NCHW format
not NHWC
test=develop
- minor fix in UT
test=develop
- Lint fixes
test=develop
* Add simplify_with_basic_ops_pass to replace dropout_op with scale_op when is_test is true.
test=develop
* Delete dropout_op directly when upscale_in_train is true.
test=develop
* Improve the debug string, adding the print of op_desc information.
* Fix the case when dropout's input x is reused as the next op's output.
* Add the pass to inference.
test=develop
* Change the log level.
test=develop
* Add unittest for inplace case.
* Add comment to explain the pass.
* Apply the pass for CPU inference.
test=develop
* Fix the typo.
test=develop
* Add the check of AttrType.
test=develop
* fix correctness of the communicator
* fix a bug in send thread when sending var context is empty, test=develop
* add lookup_table_prefetch_op and prefetch optimize, test=develop
* remove remote prefetch GPU supported
* word2vec force with CPU, test=develop
* test dist remote lookup table force with CPU, test=develop
* add pybind interface to get all inplace ops, test=develop
* enhance OpTest to check whether the consistency of operator when using and not using inplace, test=develop
* handle corner cases in op_test, test=develop
* support outputs without tensor holder_, like XShape in reshape_op, test=develop
* fix bug, some op has GradOpMaker, but actually no grad_op in OpInfoMap, test=develop
* use reshape_grad instead of reshape in FlattenGradOp, test=develop
* fix error debug dims info for variables like XShape, test=develop
* change computational order in sum_op to relieve computation difference using inplace, test=develop
* add inplace_atol to check group_norm, and skip inplace_grad for mkldnn, test=develop
* follow sneaxiy's comments, test=develop
* remove unused DefaultGradOpDescMaker in mkldnn op, test=develop
* Enhance the error message when GrapOpMaker is null.
test=develop
* Call Proto() instead of directly using proto_ pointer.
test=develop
* Rollback to use proto_ directly, because some sepecial grad ops, such some double grad ops, donot have proto.
test=develop
* add a place field in DataFeed to denote which place it will feed data to.
* abstract the copy process in CopyToFeedTensor function
* add UT for float32 type and for CUDAPlace
* Add call stack info during runtime and compile time
test=develop
* Rename operator_call_stack
test=develop
* Add unit test
test=develop
* follow comment
test=develop
* fix QueueDataset queue size,set queue size = batch size * 100, to avoid too many instances in channel when training is much slower than reading data.
* open gc by default, test=develop
* fix test_train_recognize_digits and disable gc when ngraph is enabled, test=develop
* fix conditional_block op eager deletion bug, test=develop
* add some comments to reviewers, test=develop
* support filelist size < trainer num
* pull dense when stop, to make sure local dense params are same as pserver, so save paddle model will save dense model same as pserver
* enable QueueDataset train same filelist for serveral times
* update paddle-trt for:
1. fix bug: when batch > 2, core in split plugin.
2. add leaky_relu trt5.0 support (yolov3 from 65ms to 42ms.)
3. add new attr to dropout.
4. shuffle channel, swish, relu6 support
test=develop
* 1. fix ci
test=develop
The change includes 2 things:
1. save delta model and shrink table are control by the same parameter before, now add delete_after_unseen_days to control shrink table.
2. value in sparse table has no slot before, now add slot in sparse table, and add DownpureCtrAccessor to support the new meta.
test=develop
(1)support patch data (merge slots of instances of same line id, modify dense layer which
changes its size)
(2)add fleet load_one_table interface, support load from paddle model and load from pslib model
(3)fix push sparse bug which cause push sparse cost more time(about 10% in my testcase)
(4)when some slots are not in one of your network (join/update, etc.),data feed、collect label info、push/pull sparse will skip these slots, instead of throw error.
(5)add more debug info in TrainFilesWithProfiler
Test PaddingRNN on V100 GPU device.
Test configuration: large model, padding mode (which is the mode using recurrentOp), one GPU.
GPU memory (MiB): 6414 (this PR) vs 6837 (without this PR)
Speed (steps/s): 10.28 (this PR) vs 9.89 (without this PR)
* feature/auto_growth_allocator, test=develop
* add unittest of AlignedAllocator, test=develop
* try to turn on auto_growth to test on CI, test=develop
* fix segmentation fault in mixed_vector.h, test=develop
* add unittests, test=develop
* not use transferscope cache in cpu case
test=develop
* adjust variable name and add comments
test=develop
* use correct format for class member in operator.h
* use correct format for class member in operator.cc
test=develop
* Fix Mask rcnn predictor
1. refine memory optim algorithm to support the model with the block op.
2. output diff : modify the affine channel fuse
3. add condition_block_infer op
add interface for setting trt calib table dir
test=develop
* add the missing files.
test=develop
1. Since allreduce op has 4 reduce types, We split these four reduce types into four ops
2. We also refined the collective op code, e.g. we separated the collective op kernel into CPUKernel and CUDAKernel, and remove the device specified DeviceContext parameter in template as we already knew the target DeviceContext
3. We remove the newly added Collective op role to reduce the complexity of program and graph analysis
* fix prepare context redundant code problem, optimize executor by caching create_varaiables
test=develop
* supports collective training in executor
* make fetch_list runable with variables, add more unittest for use_program_cache
test=develop
* fix comment
test=develop
* use unique name for nccl_id
* supports output to stream in program_to_code
* insert sync_comm_stream before regularization; add skip_op_callstack capability in program_to_code
* set op role in collective training
* add collective op role
* remove orig file
* add build optimizer by strategy
* add collective strategy
* refine collective strategy
* add multi-process role maker
* refine strategy building factory so that we can easily plugin more strategy
* scale loss grad in collective sgd transpiler
* add support for distributed fc
* code format
* revert some features for dist fc
* add support for distributed fc training
* fix prepare context redundant code problem, optimize executor by caching create_varaiables
test=develop
* supports collective training in executor
* make fetch_list runable with variables, add more unittest for use_program_cache
test=develop
* use unique name for nccl_id
* supports output to stream in program_to_code
* insert sync_comm_stream before regularization; add skip_op_callstack capability in program_to_code
* set op role in collective training
* add collective op role
* fix comment
test=develop
* remove orig file
* add build optimizer by strategy
* add collective strategy
* refine collective strategy
* add multi-process role maker
* refine strategy building factory so that we can easily plugin more strategy
* scale loss grad in collective sgd transpiler
* add support for distributed fc
* code format
* revert some features for dist fc
* add support for distributed fc training
* test=develop
add collective op unittest standard
* test=develop
remove the test_collective directory
* test=develop
remove the test_collective directory
* remove slicegather test
* code format for reducescatter
* update attr of shard_index_op
* Modify macro nccl_helper
* remove test without distribute
* macro collective_helper
* marcro update
* test=develop
update support python3.5
* test=develop change gpu memory use to 0.1 when test
* test=develop
update ut equal func
* test=develop
set flags to 1.5
* test=develop fix pickle dumple py35
* test=develop
fix divide in slice and add sync_comm_stream
update atol and rtol to 1e-05
rm shard_index op and test
modify read input from file to read from memory
remove origin_program in framework and add i/o in c_sync_calc_stream
* test=develop update unittest sync operator I/O