1. Since allreduce op has 4 reduce types, We split these four reduce types into four ops
2. We also refined the collective op code, e.g. we separated the collective op kernel into CPUKernel and CUDAKernel, and remove the device specified DeviceContext parameter in template as we already knew the target DeviceContext
3. We remove the newly added Collective op role to reduce the complexity of program and graph analysis
* fix prepare context redundant code problem, optimize executor by caching create_varaiables
test=develop
* supports collective training in executor
* make fetch_list runable with variables, add more unittest for use_program_cache
test=develop
* fix comment
test=develop
* use unique name for nccl_id
* supports output to stream in program_to_code
* insert sync_comm_stream before regularization; add skip_op_callstack capability in program_to_code
* set op role in collective training
* add collective op role
* remove orig file
* add build optimizer by strategy
* add collective strategy
* refine collective strategy
* add multi-process role maker
* refine strategy building factory so that we can easily plugin more strategy
* scale loss grad in collective sgd transpiler
* add support for distributed fc
* code format
* revert some features for dist fc
* add support for distributed fc training
* fix prepare context redundant code problem, optimize executor by caching create_varaiables
test=develop
* supports collective training in executor
* make fetch_list runable with variables, add more unittest for use_program_cache
test=develop
* use unique name for nccl_id
* supports output to stream in program_to_code
* insert sync_comm_stream before regularization; add skip_op_callstack capability in program_to_code
* set op role in collective training
* add collective op role
* fix comment
test=develop
* remove orig file
* add build optimizer by strategy
* add collective strategy
* refine collective strategy
* add multi-process role maker
* refine strategy building factory so that we can easily plugin more strategy
* scale loss grad in collective sgd transpiler
* add support for distributed fc
* code format
* revert some features for dist fc
* add support for distributed fc training
* test=develop
add collective op unittest standard
* test=develop
remove the test_collective directory
* test=develop
remove the test_collective directory
* remove slicegather test
* code format for reducescatter
* update attr of shard_index_op
* Modify macro nccl_helper
* remove test without distribute
* macro collective_helper
* marcro update
* test=develop
update support python3.5
* test=develop change gpu memory use to 0.1 when test
* test=develop
update ut equal func
* test=develop
set flags to 1.5
* test=develop fix pickle dumple py35
* test=develop
fix divide in slice and add sync_comm_stream
update atol and rtol to 1e-05
rm shard_index op and test
modify read input from file to read from memory
remove origin_program in framework and add i/o in c_sync_calc_stream
* test=develop update unittest sync operator I/O
* Update backward.py:
- If there is no input grad var in all outputs of previous ops, do not append this op into graph.
- Only apply this stragety when double backward.
* Update some double backward op.
* Update sum_op to judge whether a tensor is empty by numel or IsInitialized().
* test=develop add target assign for retinanet
* test=develop
run ci
* test=developp
add test_layers
* test=develop
add APi.spec
* test=develop
alter round 1
* test=develop
alter rpn_target_assign_op.cc
* test=develop
alter test_rpn_target_assign_op.py
* test=develop
alter rpn_target_assign_op.cc
* test=develop
alter API.spec
* test=develop
alter paddle/fluid/operators/detection/rpn_target_assign_op.cc
* test=develop
alter rpn_target_assign_op.cc
* test=develop
alter python/paddle/fluid/layers/detection.py
* test=develop
alter paddle/fluid/API.spec
* Remove layers.detection_map API
* Since uers can use fluid.metrics.DetectionMAP to calculate mAP of current-batch and cumulative-batch. layers.detection_map only can calculate cur-batch mAP.
* test=develop
The scatter op has a calc bug when the indices has same index, the scatter op use overwrite mode to calculate the same index, fix this bug by using the accumulate mode to calculate the same index.At the same time, the gather op has the same bug when the op calc the grad. And we use the lib of open-blas and eigen to optimize the time cost in accumulate mode.
* test=develop
Fix some code format problem, and the same time add the test case in gather and scatter op
* Cherry-pick fix random Python3 CI failure.
In some tests, SWEs used "print('xxx').format('xxx')". The syntax
is only supported in Python2, not python3. However, since those
lines are related to data download, if the CI machines already have
the data, it passes CI tests. That causes random failure.
* Cherry-pick: disable CUDNN case of test_warpctc_op
Also temporary disable a unit test. The test will be fixed under high priority.
* add deformable psroi pooling
* test=develop
* test=develop
* test=develop
modify format
* fix bug
* test=develop run ci
* test=develop
add API.spec
* add test_layers.py
* run ci again
* test=develop
run ci again
* run ci again
* test=develop
run ci again
* test=develop
run ci again
* test=develop
run ci again
* add space between two lines
* test=develop
add space between two lines
* test=develop
add space between lines
* test=develop
modify comment in nn.py
* test=develop
add space between two lines
* test=develop
add space between two lines
* update API.spec
* run ci again
* test=develop
run ci again
* rerun ci
* test=develop
rerun ci
* change input shape
* run ci
* test=develop
run ci
* modify format of nn.py
* test=develop
* test=develop
* test=develop
update API.spec
* test=develop
fix API doc
* modify API comment
* modift API comment
* test=develop
update API.spec
* test=develop
modify comment
* test=develop
modift comment
* test=develop
modift comment
* test=develop
update API.spec
* test=develop
modify comment
* test=develop
add inference in nn.py
* test=develop
update API.spec
* test=develop
resolve confict
* test=develop
update API.spec
* add unfold op
test=develop
* fix divide bug in python3 when calculating output width and height
test=develop
* add name=None in python api, move redundant code into inline function
* try to trigger ci for this code
test=develop
Add Pipeline Concurrency Train Mode:
- Cpp: pipeline_trainer & section_worker
- Python: PipelineOptimizer
- Add a new data_feed type: PrivateInstantDataFeed
- Add a test demo of pipeline trainer and the test model is gnn
- Do not support win32 now
* Enable seq_pool op to accept len 0 input
test=develop
* Update sequence_pool's api
test=develop
* Add more unittest cases for seq_pool op
test=develop
* Remove legacy comments
test=develop
* Don't use template in op maker
test=develop
* test=develop, refine api
* test=develop, fix bug when error occured on save_persistable with no optimizer
* test=develop, refine waring
* test=develop, refine example code and comments
* save optimizer related vars in dygraph
* test=develop, add optimizer save and load
* test=develop, add optimizer save and load
* test=develop, merge code and add multi-optimizer save and load
* test=develop, fix test_imperative_checkpoint
* test=develop, fix include error
* test=develop, fix include error
* test=develop, renew api spec
* test=develop, refine code
* test=develop, set default value for checkpoint
* test=develop, fix ci error
* test=develop, change API.spec and make api more readable
* test=develop, refine version and time stamp
* test=develop, add example code and refine code
* test=develop, refine doc
* test=develop, change version
* for debug
* test=develop, memory optimize for dygraph using shared_ptr
* test=develop, fix travis ci showed error
* test=develop, fix bug for recurrent usage of varbase
* test=develop, init varbase when it need to be Add
* test=develop, fix problem of recurrent gradient
* test=develop, add gradient test for recurrent varbase usage
* for debug
* test=develop, memory optimize for dygraph using shared_ptr
* test=develop, fix travis ci showed error
* test=develop, fix bug for recurrent usage of varbase
* test=develop, init varbase when it need to be Add
* fix prepare context redundant code problem, optimize executor by caching create_varaiables
test=develop
* cache sub_scope, program, var when use_program_cache=True is set
* make fetch_list runable with variables, add more unittest for use_program_cache
* add gradient clip in minimize; test=develop
* fix bug; test=develop
* fix format; test=develop
* move new grad clip to dygraph/grad_clip.py; test=develop
* fix lr decay and grad clip test; test=develop
* seperate dygraph grad clip; test=develop
* fix grad clip test; develop
* fix api spec bug; test=develop
* add blank line, test=develop,test=document_preview
to fix format problem
* fuse mul and elementwise add to fc
* Reimplement the FC forward operator
* Fix FC MKLDNN integration by transposing weights
* Add FC MKLDNN Pass
test=develop
* FC MKLDNN Pass: change memcpy to std::copy
* Fix MKLDNN FC handling of mismatch input and weights dims
* Lower tolerance for MKL-DNN in resnet50 test
test=develop
* Adjust FC to support MKLDNN Op placement
test=develop
* Adjust Placement Op to set use_mkldnn attribute for graph
test=develop
* MKLDNN FC: fix weights format so that gemm version is called
test=develop
* FC MKLDNN: Remove tolerance decrease from tester_helper
* FC MKL-DNN: Refactor the code, change input reorder to weight reorder
* MKL-DNN FC: Introduce operator caching
test=develop
* FC MKL-DNN: Fix the tensor type in ExpectedKernelType
test=develop
* FC MKL-DNN: fix style changes
test=develop
* FC MKL-DNN: fallback to native on non-supported dim sizes
test=develop
* FC MKLDNN: fix CMake paths
test=develop
* FC MKLDNN: Refine placement pass graph mkldnn attribute
test=develop
* Fix Transpiler error for fuse_conv_eltwise
test=develop
* Fix missing STL includes in files
test=develop
* FC MKL-DNN: Enable new output size computation
Also, refine pass to comply with newest interface.
test=develop
* FC MKL-DNN: enable only when fc_mkldnn_pass is enabled
* FC MKL-DNN: Allow Weights to use oi or io format
* FC MKL-DNN: Adjust UT to work with correct dims
test=develop
* Enable MKL DEBUG for resnet50 analyzer
test=develop
* FC MKL-DNN: Improve Hashing function
test=develop
* FC MKL-DNN: Fix shape for fc weights in transpiler
* FC MKL-DNN: Update input pointer in re-used fc primitive
* Add log for not handling fc fuse for unsupported dims
test=develop
* FC MKL-DNN: Move transpose from pass to Op Kernel
test=develop
* FC MKL-DNN: Disable transpose in unit test
test=develop
* FC MKL-DNN: Remove fc_mkldnn_pass from default list
* Correct Flag for fake data analyzer tests
test=develop
* FC MKL-DNN: Add comment about fc mkldnn pass disablement
test=develop
* FC MKL-DNN: Disable fc in int8 tests
test=develop
* Relu6 is the bottleneck op for Mobilenet-v2. As the mkldnn supports the conv/relu6 fusion, we implement it fusion via cpass way. Due to the int8 enabling for this fusion will be supported in MKLDNN v0.20, so this PR is focused on the fp32 optimization.
Below table shows the benchmark(FPS) which measured on skx-8180(28 cores)
Batch size | with fusion | without fusion
-- | -- | --
1 | 214.7 | 53.4
50 | 1219.727 | 137.280
test=develop
* Fix the format issue
test=develop
* Add the missing nolint comments.
test=develop
* Fix the typos.
test=develop
* Register the conv_brelu_mkldnn_fuse_pass for the MKLDNN engine.
test=develop
* Adjust the indentation.
test=develop
* Add the test_conv_brelu_mkldnn_fuse_pass case.
test=develop
* Slightly update the code per Baidu comments.
Let the parameter definition embedded into the code.
That's will make the code easy to understand.
test=develop
* add double grad for elementwise_mul. test=develop
* remove comment. test=develop
* fix grad sum. test=develop
* fix for axis expand. test=develop
* add test for axis expand. test=develop
* Add conv2d_grad_grad_op
* Extracte the cuDNN conv algo searching code in conv_cudnn_helper.h.
- Now use it in conv2d_grad_grad.
- Will simply the searching code in conv2d and conv2d_grad in next PR.
* Enhance and fix bug in unit testing of gradient_checker.
* Support to fetch empty variables,return None in Python.
* add use_cuda to inplace pass,test=develop
* add test softmax_with_xe_inplace test,test=develop
* fix potential inplace bug
test=develop
* add more skip vars in mem opt pass,test=develop
* follow comment,test=develop
* follow comments,move duplicate out arg check to program->graph,test=develop
* Add MovingAverageAbsMaxScale operator which is only used for calculating the quantization scale.
* test=develop
* change the output into inplace. test=develop
* Revert "test=develop"
This reverts commit 696cf62699ba1e1c98f61f7345ac7060010eb29a.
* Revert "change the output into inplace. test=develop"
This reverts commit a19acd20f07eee82622701a3015e6e9c073a5e0b.
* test=develop.
* update the MovingAverageAbsMaxScaleOp test. test=develop
* refine_dropout_mem,test=develop
* # This is a combination of 14 commits.
# The first commit's message is:
remove ut test_dist_word2vec in mac ci, will fix it in private, test=develop (#17066)
# This is the 2nd commit message:
Fleet unify distributed training (#16791)
* implement distributed transpiler with fleet
# This is the 3rd commit message:
ParallelDyGraph with GPU collective mode (#16827)
implement dygraph.parallel.DataParallel to hook reduce op.
# This is the 4th commit message:
Init mixed precision training interface (#16856)
* Init mixed precision training interface
* Add fp16 test script
test=develop
* All initializers support float16
test=develop
* Code cleanup & add more code annotations
test=develop
* Update API spec
test=develop
* Add usage example in doc
test=develop
# This is the 5th commit message:
fix reference_count_pass,test=develop (#17060)
test=develop
# This is the 6th commit message:
Speedup roi_perspective_transform op by caching the information of linear interpolation in forward (#17090)
* Cache the information of linear interpolation in forward and use it in backward.
test=develop
* Fix cuda kernel.
test=develop
# This is the 7th commit message:
remove unnecessary prepare_data (#17080)
test=develop
# This is the 8th commit message:
fix interpolate cu. test=develop (#17101)
# This is the 9th commit message:
test=develop, double backward leaky_relu (#17067)
backward of backward: leaky_relu
# This is the 10th commit message:
fix fuse optimizer ops (#17102)
test=develop
# This is the 11th commit message:
truncated_gaussian_random supported in distributed training, test=develop (#17091)
# This is the 12th commit message:
Detailed coordinate description for yolov3 loss (#17007)
* Detailed coordinate description for yolov3 loss
test=develop
* modified api.spec
test=develop
* modified loss name
* fix api.spec
test=develop
* polish description
test=develop
* modified api.spec
test=develop
# This is the 13th commit message:
fix test_weight_decay (#17109)
test=develop
# This is the 14th commit message:
Path flag (#17105)
* fix python/paddle/fluid/__init__.py detecting problems
* Init mixed precision training interface
* Add fp16 test script
test=develop
* All initializers support float16
test=develop
* Code cleanup & add more code annotations
test=develop
* Update API spec
test=develop
* Add usage example in doc
test=develop
* resolve#17057
Fixed the bug that fuse_relu/fuse_residual option couldn't be passed to class TestConv2dInt8Op.
test=develop
* Fix the bug of test_conv2d_int8_mkldnn case which raised by improper parameter passing.
test=develop
* Support backward of backward and a new gradient checker
* Rename decorators.py to decorator_helper.py, since Python on Windows CI has decorators package.
1. Add ReluDoubleGradMaker when register relu_grad.
2. Add a new gradient checker by comparing theoretical and numerical Jacobian. Check double gradients by double_grad_check.
* move gc test to op_test
test=develop
* Revert "move gc test to op_test"
This reverts commit cf15da65c38f57c91f53b3d8b3c2365d4aa86016.
* enable gc test in some ops
test=develop
* add parallel build script to ci test=develop
* 1. classify the test case as single card/two cards/multiple cards type
2. run test case according to the run type
Update the filter generation mechanism that it could generate the negative parameter.
The original calling(np.random.random()) couldn't simulate the conv/relu fusion case.
test=develop
* speedup gc and inplace softmax_with_cross_entropy_grad
test=develop
* refine models gpu mem
Merge skip vars and warning messages of mem opt
remove relu mem opt
test=develop
* follow comments
test=develop