* refactor dygraph,test=develop
* fix failed unittest,test=develop
* polish code,test=develop
* check windows ci error,test=develop
try to fix windows ci error by np.allclose,test=develop
* polish vlog and profiler, test=develop
* try to fix preceding ops order,test=develop
* test transformer in windows ci, test=develop
* use python c-api to speed up tracer.trace,test=develop
* test=develop, fix docker with paddle nccl problem
* test=develop, add ut for debug string and gradient_accumulator
* test=develop, add tests for layer/gradient_accumulator/prepared_op
* test=develop, fix complie error for test_prepared_op
* test=develop, add more ut for dygraph
* test=develop, create API.spec for dygraph api change
* test=develop, refoctor name to make it easier to understand
* test=develop, refoctor name to make it easier to understand
* test=develop, fix multi-gpu failed problem , add Tracer tests, change PADDLEENFORCE to PADDLEENFORCE_EQ
* test=develop, fix ut failed on parallel se-resnext
* test=develop, change one more PADDLE_ENFORCE
* support auto prune in dygraph mode
* test=develop, support auto prune
* test=develop, merge develop conflict
* test=develop, fix test_layer and test_tracer ut
* test=develop, fix bug which may cause stop_gradient disabled with a list of backward inputs
* Open fuse all reduce op
test=develop
* Add Fuse optimization op log
* Add log in fuse_optimizer op pass and fuse all_reduce op pass
* replace with boost::optional<bool>
test=develop
* Polish code
test=develop
* fix code coverage
test=develop
* refactor dygraph,test=develop
* fix failed unittest,test=develop
* polish code,test=develop
* check windows ci error,test=develop
try to fix windows ci error by np.allclose,test=develop
* polish vlog and profiler, test=develop
* try to fix preceding ops order,test=develop
* test transformer in windows ci, test=develop
* use python c-api to speed up tracer.trace,test=develop
* test=develop, fix docker with paddle nccl problem
* test=develop, add ut for debug string and gradient_accumulator
* test=develop, add tests for layer/gradient_accumulator/prepared_op
* test=develop, fix complie error for test_prepared_op
* test=develop, add more ut for dygraph
* test=develop, create API.spec for dygraph api change
* test=develop, refoctor name to make it easier to understand
* test=develop, refoctor name to make it easier to understand
* test=develop, fix multi-gpu failed problem , add Tracer tests, change PADDLEENFORCE to PADDLEENFORCE_EQ
* test=develop, fix ut failed on parallel se-resnext
* test=develop, change one more PADDLE_ENFORCE
* Support looking up embeddings from BoxPS.
* Add a _pull_box_sparse op, for now this op is not exposed to users.
* Add a BoxHelper class, providing 'BeginPass', 'EndPass', 'FeedPass' functions and so on.
* Add 'BoxPSDataset' in python code.
* Add a compile options WITH_BOX_PS and a MACRO PADDLE_WITH_BOX_PS.
* Add UT.
* More concrete information pls refer to: https://github.com/PaddlePaddle/Paddle/pull/18982
* add pybind interface to get all inplace ops, test=develop
* enhance OpTest to check whether the consistency of operator when using and not using inplace, test=develop
* handle corner cases in op_test, test=develop
* support outputs without tensor holder_, like XShape in reshape_op, test=develop
* fix bug, some op has GradOpMaker, but actually no grad_op in OpInfoMap, test=develop
* use reshape_grad instead of reshape in FlattenGradOp, test=develop
* fix error debug dims info for variables like XShape, test=develop
* change computational order in sum_op to relieve computation difference using inplace, test=develop
* add inplace_atol to check group_norm, and skip inplace_grad for mkldnn, test=develop
* follow sneaxiy's comments, test=develop
* remove unused DefaultGradOpDescMaker in mkldnn op, test=develop
* fix warpctc.dll not found issue, test=develop
* revert the linux platform change, test=develop
* delete warpctc_lib_path.h.in, test=develop
* add SetPySitePackagePath function
* fix warpctc.dylib not found issue on Mac, test=develop
* improve the paddle lib path setting logic, test=develop
* fix mac ci issue caused by test_warpctc_op unittest, test=develop
* tweak code, test=develop
* Fix Mask rcnn predictor
1. refine memory optim algorithm to support the model with the block op.
2. output diff : modify the affine channel fuse
3. add condition_block_infer op
add interface for setting trt calib table dir
test=develop
* add the missing files.
test=develop
* 1 add trt fp16 support
test=develop
(1)support patch data (merge slots of instances of same line id, modify dense layer which
changes its size)
(2)add fleet load_one_table interface, support load from paddle model and load from pslib model
(3)fix push sparse bug which cause push sparse cost more time(about 10% in my testcase)
(4)when some slots are not in one of your network (join/update, etc.),data feed、collect label info、push/pull sparse will skip these slots, instead of throw error.
(5)add more debug info in TrainFilesWithProfiler
* feature/auto_growth_allocator, test=develop
* add unittest of AlignedAllocator, test=develop
* try to turn on auto_growth to test on CI, test=develop
* fix segmentation fault in mixed_vector.h, test=develop
* add unittests, test=develop
* Fix Mask rcnn predictor
1. refine memory optim algorithm to support the model with the block op.
2. output diff : modify the affine channel fuse
3. add condition_block_infer op
add interface for setting trt calib table dir
test=develop
* add the missing files.
test=develop
1. Since allreduce op has 4 reduce types, We split these four reduce types into four ops
2. We also refined the collective op code, e.g. we separated the collective op kernel into CPUKernel and CUDAKernel, and remove the device specified DeviceContext parameter in template as we already knew the target DeviceContext
3. We remove the newly added Collective op role to reduce the complexity of program and graph analysis
* fix prepare context redundant code problem, optimize executor by caching create_varaiables
test=develop
* supports collective training in executor
* make fetch_list runable with variables, add more unittest for use_program_cache
test=develop
* fix comment
test=develop
* use unique name for nccl_id
* supports output to stream in program_to_code
* insert sync_comm_stream before regularization; add skip_op_callstack capability in program_to_code
* set op role in collective training
* add collective op role
* remove orig file
* add build optimizer by strategy
* add collective strategy
* refine collective strategy
* add multi-process role maker
* refine strategy building factory so that we can easily plugin more strategy
* scale loss grad in collective sgd transpiler
* add support for distributed fc
* code format
* revert some features for dist fc
* add support for distributed fc training
* fix prepare context redundant code problem, optimize executor by caching create_varaiables
test=develop
* supports collective training in executor
* make fetch_list runable with variables, add more unittest for use_program_cache
test=develop
* use unique name for nccl_id
* supports output to stream in program_to_code
* insert sync_comm_stream before regularization; add skip_op_callstack capability in program_to_code
* set op role in collective training
* add collective op role
* fix comment
test=develop
* remove orig file
* add build optimizer by strategy
* add collective strategy
* refine collective strategy
* add multi-process role maker
* refine strategy building factory so that we can easily plugin more strategy
* scale loss grad in collective sgd transpiler
* add support for distributed fc
* code format
* revert some features for dist fc
* add support for distributed fc training
* test=develop
add collective op unittest standard
* test=develop
remove the test_collective directory
* test=develop
remove the test_collective directory
* remove slicegather test
* code format for reducescatter
* update attr of shard_index_op
* Modify macro nccl_helper
* remove test without distribute
* macro collective_helper
* marcro update
* test=develop
update support python3.5
* test=develop change gpu memory use to 0.1 when test
* test=develop
update ut equal func
* test=develop
set flags to 1.5
* test=develop fix pickle dumple py35
* test=develop
fix divide in slice and add sync_comm_stream
update atol and rtol to 1e-05
rm shard_index op and test
modify read input from file to read from memory
remove origin_program in framework and add i/o in c_sync_calc_stream
* test=develop update unittest sync operator I/O
(1) use channel instead of vector/BlockingQueue in Dataset,to keep same with existing implementation, and make code more readable and flexible (dataset single output channel or multi output channel). one previous memory out of limit problem is cause by not release memory after training.
(2) add Record because MultiSlotType costs too much memory (80B),fix memory out of limit problem.
(3) add Channel, Archive in paddle/fluid/framework
(4) change dataset from shared_ptr to unique_ptr in pybind
(5) move create/destroy readers from trainer to dataset
(6) move shuffle from datafeed to dataset. dataset holds memory, datafeed is only for load data and feed data to network.
(7) fix thread num bug of Dataset when filelist size < thread num
(8) support set_queue_num in InMemoryDataset
* for debug
* test=develop, memory optimize for dygraph using shared_ptr
* test=develop, fix travis ci showed error
* test=develop, fix bug for recurrent usage of varbase
* test=develop, init varbase when it need to be Add
* fix prepare context redundant code problem, optimize executor by caching create_varaiables
test=develop
* cache sub_scope, program, var when use_program_cache=True is set
* make fetch_list runable with variables, add more unittest for use_program_cache
* fluid int8 train and trt int8 predict align.
trt int8 predict init
op converter
* 2. align fluid int8 train and trt int8 inference.
enhance quant dequant fuse pass
enhance op converter, trt engine, trt engine op, trt subgraph pass.
* 3. add delete_quant_dequant_pass for trt
test=develop
* 4. add the missing file
test=develop
* 5. i modify the c++ interface, but forget to modify the pybind code
fix the IS_TRT_VERSION_GE bug, and fix elementwise op converter
test=develop
* Add conv2d_grad_grad_op
* Extracte the cuDNN conv algo searching code in conv_cudnn_helper.h.
- Now use it in conv2d_grad_grad.
- Will simply the searching code in conv2d and conv2d_grad in next PR.
* Enhance and fix bug in unit testing of gradient_checker.
* Support to fetch empty variables,return None in Python.
Fix the following API examples:
paddle.fluid.scope_guard
paddle.fluid.backward.append_backward
paddle.fluid.cpu_places
paddle.fluid.cuda_pinned_places
paddle.fluid.cuda_places
paddle.fluid.in_dygraph_mode
paddle.fluid.CUDAPlace
paddle.fluid.CPUPlace
paddle.fluid.CUDAPinnedPlace
* speedup gc and inplace softmax_with_cross_entropy_grad
test=develop
* refine models gpu mem
Merge skip vars and warning messages of mem opt
remove relu mem opt
test=develop
* follow comments
test=develop
* Support Sync Batch Norm.
* Note, do not enable it in one device.
Usage:
build_strategy = fluid.BuildStrategy()
build_strategy.sync_batch_norm = True
binary = fluid.compiler.CompiledProgram(tp).with_data_parallel(
loss_name=loss_mean.name,
build_strategy=build_strategy)
* Remove Desc in Forward Pass
* Refactor VarBase
* Add dbg info
* Only check type in imperative mode
* Polish code and support optimizer
test=develop
* Fix stop gradient problem in PyLayer
test=develop
* Remove some superfluous std::move calls
The std:move triggered a build error (with -Werror):
```
[ 9%] Building CXX object paddle/fluid/memory/allocation/CMakeFiles/allocator_facade.dir/allocator_facade.cc.o
/home/tej/code/gbuella_paddle/paddle/fluid/memory/allocation/allocator_facade.cc:86:29: error: moving a temporary object prevents copy elision [-Werror,-Wpessimizing-move]
[this] { return std::move(CreateAllocatorWithChunk()); }, capacity);
^
/home/tej/code/gbuella_paddle/paddle/fluid/memory/allocation/allocator_facade.cc:86:29: note: remove std::move call here
[this] { return std::move(CreateAllocatorWithChunk()); }, capacity);
^~~~~~~~~~ ~
1 error generated.
```
See: https://reviews.llvm.org/D7633
* Remove a superfluous lambda capture from framework/operator.h
```
[ 10%] Building CXX object paddle/fluid/platform/CMakeFiles/device_context.dir/init.cc.o
In file included from /home/tej/code/gbuella_paddle/paddle/fluid/platform/init.cc:19:
/home/tej/code/gbuella_paddle/paddle/fluid/framework/operator.h:229:21: error: lambda capture 'this' is not used [-Werror,-Wunused-lambda-capture]
[this](Variable* var) { return var; });
^~~~
1 error generated.
```
Changing it to `return it->second;`, as is in the function below.
* Rethrow an exception (instead of copying it)
```
[ 11%] Building CXX object paddle/fluid/framework/CMakeFiles/operator.dir/operator.cc.o
/home/tej/code/gbuella_paddle/paddle/fluid/framework/operator.cc:191:13: error: local variable 'exception' will be copied despite being thrown by name [-Werror,-Wreturn-std-move]
throw exception;
^~~~~~~~~
/home/tej/code/gbuella_paddle/paddle/fluid/framework/operator.cc:191:13: note: call 'std::move' explicitly to avoid copying
throw exception;
^~~~~~~~~
std::move(exception)
```
See https://reviews.llvm.org/D43322 for an explanation of this diagnostic message.
* Remove an unused variable
```
/home/tej/code/gbuella_paddle/paddle/fluid/framework/operator.cc:884:16: error: private field 'scope_' is not used [-Werror,-Wunused-private-field]
const Scope& scope_;
^
```
* struct ComputationOpHandle -> class ComputationOpHandle
```
[ 13%] Building CXX object paddle/fluid/framework/details/CMakeFiles/memory_early_delete_pass.dir/memory_early_delete_pass.cc.o
In file included from /home/tej/code/gbuella_paddle/paddle/fluid/framework/details/memory_early_delete_pass.cc:21:
/home/tej/code/gbuella_paddle/paddle/fluid/framework/details/reference_count_pass_helper.h:30:1: error: class 'ComputationOpHandle' was previously declared as a struct; this is valid, but may result in linker errors under the Microsoft C++ ABI [-Werror,-Wmismatched-tags]
class ComputationOpHandle;
^
/home/tej/code/gbuella_paddle/paddle/fluid/framework/details/computation_op_handle.h:29:8: note: previous use is here
struct ComputationOpHandle : public OpHandleBase {
^
/home/tej/code/gbuella_paddle/paddle/fluid/framework/details/reference_count_pass_helper.h:30:1: note: did you mean struct here?
class ComputationOpHandle;
^~~~~
struct
1 error generated.
```
* Fix name() methods under fluid/operators
```
In file included from /home/tej/code/gbuella_paddle/paddle/fluid/operators/jit/gen/act.cc:15:
In file included from /home/tej/code/gbuella_paddle/paddle/fluid/operators/jit/gen/act.h:19:
/home/tej/code/gbuella_paddle/paddle/fluid/operators/jit/gen/jitcode.h:71:23: error: 'name' overrides a member function but is not marked 'override' [-Werror,-Winconsistent-missing-override]
virtual const char* name() const = 0;
^
/home/tej/code/gbuella_paddle/paddle/fluid/operators/jit/gen_base.h:31:23: note: overridden virtual function is here
virtual const char* name() const = 0;
^
```
test=develop