1. Since allreduce op has 4 reduce types, We split these four reduce types into four ops
2. We also refined the collective op code, e.g. we separated the collective op kernel into CPUKernel and CUDAKernel, and remove the device specified DeviceContext parameter in template as we already knew the target DeviceContext
3. We remove the newly added Collective op role to reduce the complexity of program and graph analysis
* fix prepare context redundant code problem, optimize executor by caching create_varaiables
test=develop
* supports collective training in executor
* make fetch_list runable with variables, add more unittest for use_program_cache
test=develop
* fix comment
test=develop
* use unique name for nccl_id
* supports output to stream in program_to_code
* insert sync_comm_stream before regularization; add skip_op_callstack capability in program_to_code
* set op role in collective training
* add collective op role
* remove orig file
* add build optimizer by strategy
* add collective strategy
* refine collective strategy
* add multi-process role maker
* refine strategy building factory so that we can easily plugin more strategy
* scale loss grad in collective sgd transpiler
* add support for distributed fc
* code format
* revert some features for dist fc
* add support for distributed fc training
* fix prepare context redundant code problem, optimize executor by caching create_varaiables
test=develop
* supports collective training in executor
* make fetch_list runable with variables, add more unittest for use_program_cache
test=develop
* use unique name for nccl_id
* supports output to stream in program_to_code
* insert sync_comm_stream before regularization; add skip_op_callstack capability in program_to_code
* set op role in collective training
* add collective op role
* fix comment
test=develop
* remove orig file
* add build optimizer by strategy
* add collective strategy
* refine collective strategy
* add multi-process role maker
* refine strategy building factory so that we can easily plugin more strategy
* scale loss grad in collective sgd transpiler
* add support for distributed fc
* code format
* revert some features for dist fc
* add support for distributed fc training
* test=develop
add collective op unittest standard
* test=develop
remove the test_collective directory
* test=develop
remove the test_collective directory
* remove slicegather test
* code format for reducescatter
* update attr of shard_index_op
* Modify macro nccl_helper
* remove test without distribute
* macro collective_helper
* marcro update
* test=develop
update support python3.5
* test=develop change gpu memory use to 0.1 when test
* test=develop
update ut equal func
* test=develop
set flags to 1.5
* test=develop fix pickle dumple py35
* test=develop
fix divide in slice and add sync_comm_stream
update atol and rtol to 1e-05
rm shard_index op and test
modify read input from file to read from memory
remove origin_program in framework and add i/o in c_sync_calc_stream
* test=develop update unittest sync operator I/O
1. fix the bug that out_put_var in SaveSelectedRows would be empty string
2. use merge_sparse_lookup_table to replace sum op for load_persistables_for_inference
3. fix the bug in _clone_var_in_block_ when the var is SELECTED_ROWS.
(1) use channel instead of vector/BlockingQueue in Dataset,to keep same with existing implementation, and make code more readable and flexible (dataset single output channel or multi output channel). one previous memory out of limit problem is cause by not release memory after training.
(2) add Record because MultiSlotType costs too much memory (80B),fix memory out of limit problem.
(3) add Channel, Archive in paddle/fluid/framework
(4) change dataset from shared_ptr to unique_ptr in pybind
(5) move create/destroy readers from trainer to dataset
(6) move shuffle from datafeed to dataset. dataset holds memory, datafeed is only for load data and feed data to network.
(7) fix thread num bug of Dataset when filelist size < thread num
(8) support set_queue_num in InMemoryDataset
* test=develop, add add_multi_gpu_install_check
* test=develop, refine warning doc
* test=develop, refine warning doc
* test=develop, refine warning doc
* test=develop, support multi cpu
* test=develop, find right num of cuda device
* test=develop, find right num of cuda device
* test=develop, fix multigpu processing and fix type bug in dygraph
* test=develop, fix multigpu processing and fix type bug in dygraph
* Update backward.py:
- If there is no input grad var in all outputs of previous ops, do not append this op into graph.
- Only apply this stragety when double backward.
* Update some double backward op.
* Update sum_op to judge whether a tensor is empty by numel or IsInitialized().
* test=develop add target assign for retinanet
* test=develop
run ci
* test=developp
add test_layers
* test=develop
add APi.spec
* test=develop
alter round 1
* test=develop
alter rpn_target_assign_op.cc
* test=develop
alter test_rpn_target_assign_op.py
* test=develop
alter rpn_target_assign_op.cc
* test=develop
alter API.spec
* test=develop
alter paddle/fluid/operators/detection/rpn_target_assign_op.cc
* test=develop
alter rpn_target_assign_op.cc
* test=develop
alter python/paddle/fluid/layers/detection.py
* test=develop
alter paddle/fluid/API.spec
* Remove layers.detection_map API
* Since uers can use fluid.metrics.DetectionMAP to calculate mAP of current-batch and cumulative-batch. layers.detection_map only can calculate cur-batch mAP.
* test=develop
The scatter op has a calc bug when the indices has same index, the scatter op use overwrite mode to calculate the same index, fix this bug by using the accumulate mode to calculate the same index.At the same time, the gather op has the same bug when the op calc the grad. And we use the lib of open-blas and eigen to optimize the time cost in accumulate mode.
* test=develop
Fix some code format problem, and the same time add the test case in gather and scatter op
* Cherry-pick fix random Python3 CI failure.
In some tests, SWEs used "print('xxx').format('xxx')". The syntax
is only supported in Python2, not python3. However, since those
lines are related to data download, if the CI machines already have
the data, it passes CI tests. That causes random failure.
* Cherry-pick: disable CUDNN case of test_warpctc_op
Also temporary disable a unit test. The test will be fixed under high priority.
* add deformable psroi pooling
* test=develop
* test=develop
* test=develop
modify format
* fix bug
* test=develop run ci
* test=develop
add API.spec
* add test_layers.py
* run ci again
* test=develop
run ci again
* run ci again
* test=develop
run ci again
* test=develop
run ci again
* test=develop
run ci again
* add space between two lines
* test=develop
add space between two lines
* test=develop
add space between lines
* test=develop
modify comment in nn.py
* test=develop
add space between two lines
* test=develop
add space between two lines
* update API.spec
* run ci again
* test=develop
run ci again
* rerun ci
* test=develop
rerun ci
* change input shape
* run ci
* test=develop
run ci
* modify format of nn.py
* test=develop
* test=develop
* test=develop
update API.spec
* test=develop
fix API doc
* modify API comment
* modift API comment
* test=develop
update API.spec
* test=develop
modify comment
* test=develop
modift comment
* test=develop
modift comment
* test=develop
update API.spec
* test=develop
modify comment
* test=develop
add inference in nn.py
* test=develop
update API.spec
* test=develop
resolve confict
* test=develop
update API.spec
* add unfold op
test=develop
* fix divide bug in python3 when calculating output width and height
test=develop
* add name=None in python api, move redundant code into inline function
* try to trigger ci for this code
test=develop
* add 'UserDefinedRoleMakerNCCL' for collective mode.
* code style
* add the name UserDefinedRoleMakerNCCL to __all__
* rename to UserDefinedRoleMakerCollective
* rename to UserDefinedCollectiveRoleMaker
Add Pipeline Concurrency Train Mode:
- Cpp: pipeline_trainer & section_worker
- Python: PipelineOptimizer
- Add a new data_feed type: PrivateInstantDataFeed
- Add a test demo of pipeline trainer and the test model is gnn
- Do not support win32 now
* Enable seq_pool op to accept len 0 input
test=develop
* Update sequence_pool's api
test=develop
* Add more unittest cases for seq_pool op
test=develop
* Remove legacy comments
test=develop
* Don't use template in op maker
test=develop
* test=develop, refine api
* test=develop, fix bug when error occured on save_persistable with no optimizer
* test=develop, refine waring
* test=develop, refine example code and comments
* save optimizer related vars in dygraph
* test=develop, add optimizer save and load
* test=develop, add optimizer save and load
* test=develop, merge code and add multi-optimizer save and load
* test=develop, fix test_imperative_checkpoint
* test=develop, fix include error
* test=develop, fix include error
* test=develop, renew api spec
* test=develop, refine code
* test=develop, set default value for checkpoint
* test=develop, fix ci error
* test=develop, change API.spec and make api more readable
* test=develop, refine version and time stamp
* test=develop, add example code and refine code
* test=develop, refine doc
* test=develop, change version
* for debug
* test=develop, memory optimize for dygraph using shared_ptr
* test=develop, fix travis ci showed error
* test=develop, fix bug for recurrent usage of varbase
* test=develop, init varbase when it need to be Add
* test=develop, fix problem of recurrent gradient
* test=develop, add gradient test for recurrent varbase usage