gongweibao
89c4b3ddcf
Add bash_test_modules function to capture the timeout or failed context. ( #20197 )
6 years ago
tangwei12
8f0b3c0516
the integrated communicator ( #19849 )
...
* add a base class for the Communicator
* add AsyncCommunicator Impl for async distributed training
6 years ago
Yi Liu
4ef6b8457a
adapte fleet api for localsgd and support nccl comm configuration in executor ( #19443 )
...
test=develop
6 years ago
chengduo
5a579df9ba
[Speedup] Make dygraph data parallel faster ( #19280 )
...
* update parallel.py
test=develop
6 years ago
kh2se2013
27e85625b8
add python coverage launch when WITH_COVERAGE=ON ( #19264 )
...
add python coverage launch when WITH_COVERAGE=ON
6 years ago
gongweibao
29d8781240
Polish fleet API to support cuda collective mode and nccl2 mode. ( #18966 )
...
Polish fleet API to support cuda collective mode and nccl2 mode
6 years ago
Zeng Jinle
c194b0c835
Try to deprecate unstable python memory optimize ( #18983 )
...
* deprecate python memory optimize, test=develop
* remove memory_optimize in unittests, test=develop
* add unittests to deprecated interfaces, test=develop
6 years ago
chengduo
17d62ab220
Enhance fuse optimization op pass ( #19010 )
...
* Enhance fuse optimization op pass
test=develop
6 years ago
gongweibao
c0a82748cf
Polish backwards optimizer dependency codes and use more default values. ( #18255 )
6 years ago
guru4elephant
7d76e34ec2
add more print function for timeout issue, make timeout value larger ( #18219 )
...
* add more print function for timeout issue, make timeout value larger
6 years ago
guru4elephant
0941e3e013
add class name and timeline for test_dist_base.py ( #18122 )
...
* add class name and timeline for test_dist_base.py
6 years ago
guru4elephant
b2cfdc3891
Refine unittest log ( #18084 )
...
* add print log for unittest of distributed training
test=develop
6 years ago
gongweibao
f5caf3443c
Fix reinitialized ncclid error! ( #18025 )
6 years ago
gongweibao
fbbdc9ccad
Add backward and optimizer operator dependency pass. ( #17746 )
6 years ago
gongweibao
65bbf950ee
Add multi-ncclcomm and 2D ncclallreduce support. ( #17263 )
6 years ago
Yan Xu
0217555530
polish parallel dygraph code ( #17164 )
...
* add var grad hook test=develop
6 years ago
Yan Xu
0b07eef118
ParallelDyGraph with GPU collective mode ( #16827 )
...
implement dygraph.parallel.DataParallel to hook reduce op.
6 years ago
tangwei12
1a4a51db2b
Fleet unify distributed training ( #16791 )
...
* implement distributed transpiler with fleet
6 years ago
Qiao Longfei
61912e879d
test_dist_base set runtime_split_send_recv to false test=develop
6 years ago
Qiao Longfei
d8974e6da0
Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into add-async-ssa-graph-executor-communicator
...
test=develop
6 years ago
gongweibao
eb83abeac3
Add DGC(Deep Gradient Compression) interface. ( #15841 )
6 years ago
Qiao Longfei
30618409db
Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into add-async-ssa-graph-executor-communicator
6 years ago
Wu Yi
8bebfe5640
add resnet nccl2 dist training, mp training unit test ( #16167 )
...
* add resnet nccl2 test=develop
* test dist train test=develop
* update test=develop
* increase timeout test=develop
* test on CI env test=develop
6 years ago
Wu Yi
6382b62f6b
Collective ops ( #15572 )
...
* wip allreduce in op
* wip
* wip
* wip
* wip adding test
* wip for conflict with mp mode
* fix tests test=develop
* fix cpu build test=develop
* fix travis clang format test=develop
* fix cpu build test=develop
* update api.spec test=develop
* delete comment test=develop
* fix cpplint test=develop
* fix test=develop
* follow comment test=develop
* add file test=develop
* fix build test=develop
* update test=develop
* to be compatible with sync_bn, and fix mp mode in develop test=develop
6 years ago
liuwei1031
caadd0581d
add IfElse test case for ir memory optimize ( #15998 )
...
* add ir memory optimize test case for IfElse op, test=develop
* fix some unitttest failure by force using the python memory_optimize, test=develop
* tweak comments, test=develop
* fix unittest, test=develop
* fix unittest, test=develop
6 years ago
Qiao Longfei
5cf0092825
add more log and fix test_dist_base in multi_batch_merge_pass
6 years ago
Qiao Longfei
4356f186b4
complete parameter_send
6 years ago
Zeng Jinle
dec89bd7ed
Merge pull request #15460 from sneaxiy/try_to_turn_on_remove_unnecessary_lock
...
Turn on remove_unnecessary_lock by default
6 years ago
tangwei12
8b50ad80ff
checkpoint at distributed training ( #14854 )
...
checkpoint for distributed training.
6 years ago
sneaxiy
ef788603d4
merge develop
...
test=develop
6 years ago
WangZhen
bac08c4a26
Fix some bugs caused by set functions of the Pass class. test=develop
6 years ago
sneaxiy
d8568acd19
turn on remove_unnecessary_lock
...
test=develop
6 years ago
Xin Pan
7526ac14e3
add comments
...
test=develop
6 years ago
Xin Pan
beaae61a16
polish
...
test=develop
6 years ago
Xin Pan
5e928e579a
try unify Executor and ParallelExecutor
...
test=develop
6 years ago
Yancey1989
8cad371a60
fix nccl unittest acc test=develop
6 years ago
Yan Xu
5384206aec
Merge pull request #14869 from Yancey1989/fix_dist_unittest
...
fix dist unit test
6 years ago
Yancey1989
fa1f77e20c
enable ci test=develop
6 years ago
Wu Yi
f95ee9c09f
fix nccl dist test acc ( #14867 )
...
* fix nccl dist test acc test=develop
* fix test=develop
6 years ago
Wu Yi
554bcdbdfc
add more log for dist test for ci test=develop ( #14813 )
...
* add more log for dist test for ci test=develop
* increase deadline test=develop
6 years ago
Wu Yi
aebc175cd4
add nccl2 dist tests ( #14755 )
...
* add nccl2 dist tests test=develop
* fix dist_base test=develop
* fix tests test=develop
* fix test on mac test=develop
6 years ago
Wu Yi
e2011f1353
test dist ut fixes test=develop ( #14706 )
...
* test dist ut fixes test=develop
* fix cmake
* for test
6 years ago
Xin Pan
44ecf9a481
fix
...
test=develop
6 years ago
Xin Pan
9735e3016a
fix test
...
the build strategy is finalized after create_passes. So future
change of build strategy has no effects.
test=develop
6 years ago
Wu Yi
306236c2c0
feature/DC asgd ( #12722 )
...
* wip
* add ref_by_trainer_id op
* ready to test
* fix ref inputs
* refine rpc_op_handle
* fix merge bug
6 years ago
Wu Yi
d186e7434e
Refine dist ut ( #14118 )
...
* fix use_reader_alloc uts
* dist ut fixes test=develop
* update test=develop
* fix test for py3 test=develop
6 years ago
minqiyang
59420d5bd2
Polish code
...
test=develop
7 years ago
minqiyang
2cc939bbfa
Fix Mac Python3 CI job
...
test=develop
7 years ago
Wu Yi
26200f2e42
[1.1] [project] train imagenet using large batch size ( #13766 )
...
* fix nccl2 lars dist support
* put lars in momentum op
* add tests lars
* fix ci
* fix cpu kernel
* soft warning
* remove lars in test_recognize_digits.py
* move to another op
* add file
* update api.spec test=develop
* update test=develop
* fix api.spec test=develop
* wip
* wip, finish grad merge ops
* wip, finish graph build
* wip test running
* work on 1 gpu
* workable version
* update
* fix tests
* fuse broadcast op
* fix compile failed
* refine
* add batch merge test mnist
* fix CI test=develop
* fix build
* use independent bn params for batch merge test=develop
* update api.spec
* follow comments and for test
* wip
* refine tests test=develop
* follow comments test=develop
* remove startup bn modify test=develop
* follow comments test=develop
* fix merge test=develop
7 years ago
tangwei12
b35239df2b
fix dist ut with place, test=develop ( #13647 )
7 years ago