Commit Graph

80 Commits (b085ecc25896c0a4aea70bcfff316683a76ec5e4)

Author SHA1 Message Date
gongweibao a5fc291fe5 Use 2 cards for hallreduce unit test. (#21085)
5 years ago
lilong12 53148e0696
modify the implementation of save_persistables and save_inference_model for fleet collective mode (#20802)
5 years ago
gongweibao e425124041
Wait pserver to complete initialization. (#20777)
5 years ago
gongweibao 8088395a84
Set unique port to every distribute test to avoid potential port conflicts (#20759)
5 years ago
WangXi 507afa8a8a Fix dgc nan by stripping nccl from sparseReduce. (#20630)
5 years ago
gongweibao c1710e91b2
Disable GRPC_ARG_ALLOW_REUSEPORT to avoid potencial problem. (#20690)
5 years ago
gongweibao f3f52fc1e2
Retry when failed to bind address. (#20642)
6 years ago
WangXi cadc6a9704 fix dgc test and bug when not set trainers_endpoints_, test=develop (#20617)
6 years ago
gongweibao bf6470c71e
Add detail logs on resnet unit test (#20558)
6 years ago
gongweibao 89c4b3ddcf
Add bash_test_modules function to capture the timeout or failed context. (#20197)
6 years ago
tangwei12 8f0b3c0516
the integrated communicator (#19849)
6 years ago
Yi Liu 4ef6b8457a
adapte fleet api for localsgd and support nccl comm configuration in executor (#19443)
6 years ago
chengduo 5a579df9ba
[Speedup] Make dygraph data parallel faster (#19280)
6 years ago
kh2se2013 27e85625b8 add python coverage launch when WITH_COVERAGE=ON (#19264)
6 years ago
gongweibao 29d8781240
Polish fleet API to support cuda collective mode and nccl2 mode. (#18966)
6 years ago
Zeng Jinle c194b0c835
Try to deprecate unstable python memory optimize (#18983)
6 years ago
chengduo 17d62ab220
Enhance fuse optimization op pass (#19010)
6 years ago
gongweibao c0a82748cf
Polish backwards optimizer dependency codes and use more default values. (#18255)
6 years ago
guru4elephant 7d76e34ec2
add more print function for timeout issue, make timeout value larger (#18219)
6 years ago
guru4elephant 0941e3e013
add class name and timeline for test_dist_base.py (#18122)
6 years ago
guru4elephant b2cfdc3891
Refine unittest log (#18084)
6 years ago
gongweibao f5caf3443c
Fix reinitialized ncclid error! (#18025)
6 years ago
gongweibao fbbdc9ccad
Add backward and optimizer operator dependency pass. (#17746)
6 years ago
gongweibao 65bbf950ee
Add multi-ncclcomm and 2D ncclallreduce support. (#17263)
6 years ago
Yan Xu 0217555530 polish parallel dygraph code (#17164)
6 years ago
Yan Xu 0b07eef118
ParallelDyGraph with GPU collective mode (#16827)
6 years ago
tangwei12 1a4a51db2b
Fleet unify distributed training (#16791)
6 years ago
Qiao Longfei 61912e879d test_dist_base set runtime_split_send_recv to false test=develop
6 years ago
Qiao Longfei d8974e6da0 Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into add-async-ssa-graph-executor-communicator
6 years ago
gongweibao eb83abeac3
Add DGC(Deep Gradient Compression) interface. (#15841)
6 years ago
Qiao Longfei 30618409db Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into add-async-ssa-graph-executor-communicator
6 years ago
Wu Yi 8bebfe5640
add resnet nccl2 dist training, mp training unit test (#16167)
6 years ago
Wu Yi 6382b62f6b
Collective ops (#15572)
6 years ago
liuwei1031 caadd0581d
add IfElse test case for ir memory optimize (#15998)
6 years ago
Qiao Longfei 5cf0092825 add more log and fix test_dist_base in multi_batch_merge_pass
6 years ago
Qiao Longfei 4356f186b4 complete parameter_send
6 years ago
Zeng Jinle dec89bd7ed
Merge pull request #15460 from sneaxiy/try_to_turn_on_remove_unnecessary_lock
6 years ago
tangwei12 8b50ad80ff
checkpoint at distributed training (#14854)
6 years ago
sneaxiy ef788603d4 merge develop
6 years ago
WangZhen bac08c4a26 Fix some bugs caused by set functions of the Pass class. test=develop
6 years ago
sneaxiy d8568acd19 turn on remove_unnecessary_lock
6 years ago
Xin Pan 7526ac14e3 add comments
6 years ago
Xin Pan beaae61a16 polish
6 years ago
Xin Pan 5e928e579a try unify Executor and ParallelExecutor
6 years ago
Yancey1989 8cad371a60 fix nccl unittest acc test=develop
6 years ago
Yan Xu 5384206aec
Merge pull request #14869 from Yancey1989/fix_dist_unittest
6 years ago
Yancey1989 fa1f77e20c enable ci test=develop
6 years ago
Wu Yi f95ee9c09f
fix nccl dist test acc (#14867)
6 years ago
Wu Yi 554bcdbdfc
add more log for dist test for ci test=develop (#14813)
6 years ago
Wu Yi aebc175cd4
add nccl2 dist tests (#14755)
6 years ago