Commit Graph

97 Commits (adaec0073d02c0ea55bcabc4671ebfc8dbd3182c)

Author SHA1 Message Date
ShenLiang 9401173e3a
Remove scale loss before reduce in dygraph (#30807)
5 years ago
liuyuhui 4a8b8b4547
[Kunlun] add gen_bkcl_id_op, support multi XPU cards training using multiprocess (#30858)
5 years ago
WangXi b1026f64af
【kunlun】dygraph supports multi xpu card training (#30671)
5 years ago
WangXi ee16006b5d
Optimization grad merge performance (#29784)
5 years ago
WangXi 467c716963
gen nccl id use socket (#29431)
5 years ago
lilong12 f77a78cdee
enable pipeline to run with Executor.run() (#28373)
5 years ago
Chen Weihang dec53a9c79
Remove DataParallel.scale_loss & apply_collective_grads (#27603)
6 years ago
lilong12 bbc2add703
Initialize gloo for low level collective apis (#27672)
6 years ago
lilong12 36c0410223
Revert "Initialize gloo for low level collective apis (#27356)", test=document_fix (#27665)
6 years ago
lilong12 fa73e4a284
Initialize gloo for low level collective apis (#27356)
6 years ago
danleifeng 6b4ca0d7f1
【paddle.fleet】distributed_optimizer supports dygraph (#26541)
6 years ago
Chen Weihang 31f422ae5e
Add interface to launch parallel dygraph by multiprocessing (#26044)
6 years ago
tangwei12 4b3778a3ee
Revert/barrier for sync (#25417)
6 years ago
tangwei12 9825a9f3ca
disable distributed UT temporary (#25300)
6 years ago
WangXi 8d47162e03
Close fuse when use dgc & move DGC strategy from PE to compiler, test=develop (#22914)
6 years ago
WangXi 3ec289a6a3 fix sync_batch_norm hang in fleet (#21838)
6 years ago
WangXi a2175cfc96 Tmp fix fleet bug in py35 gcc8 CI, test=develop (#21703)
6 years ago
gongweibao a5fc291fe5 Use 2 cards for hallreduce unit test. (#21085)
7 years ago
lilong12 53148e0696
modify the implementation of save_persistables and save_inference_model for fleet collective mode (#20802)
7 years ago
gongweibao e425124041
Wait pserver to complete initialization. (#20777)
7 years ago
gongweibao 8088395a84
Set unique port to every distribute test to avoid potential port conflicts (#20759)
7 years ago
WangXi 507afa8a8a Fix dgc nan by stripping nccl from sparseReduce. (#20630)
7 years ago
gongweibao c1710e91b2
Disable GRPC_ARG_ALLOW_REUSEPORT to avoid potencial problem. (#20690)
7 years ago
gongweibao f3f52fc1e2
Retry when failed to bind address. (#20642)
7 years ago
WangXi cadc6a9704 fix dgc test and bug when not set trainers_endpoints_, test=develop (#20617)
7 years ago
gongweibao bf6470c71e
Add detail logs on resnet unit test (#20558)
7 years ago
gongweibao 89c4b3ddcf
Add bash_test_modules function to capture the timeout or failed context. (#20197)
7 years ago
tangwei12 8f0b3c0516
the integrated communicator (#19849)
7 years ago
Yi Liu 4ef6b8457a
adapte fleet api for localsgd and support nccl comm configuration in executor (#19443)
7 years ago
chengduo 5a579df9ba
[Speedup] Make dygraph data parallel faster (#19280)
7 years ago
kh2se2013 27e85625b8 add python coverage launch when WITH_COVERAGE=ON (#19264)
7 years ago
gongweibao 29d8781240
Polish fleet API to support cuda collective mode and nccl2 mode. (#18966)
7 years ago
Zeng Jinle c194b0c835
Try to deprecate unstable python memory optimize (#18983)
7 years ago
chengduo 17d62ab220
Enhance fuse optimization op pass (#19010)
7 years ago
gongweibao c0a82748cf
Polish backwards optimizer dependency codes and use more default values. (#18255)
7 years ago
guru4elephant 7d76e34ec2
add more print function for timeout issue, make timeout value larger (#18219)
7 years ago
guru4elephant 0941e3e013
add class name and timeline for test_dist_base.py (#18122)
7 years ago
guru4elephant b2cfdc3891
Refine unittest log (#18084)
7 years ago
gongweibao f5caf3443c
Fix reinitialized ncclid error! (#18025)
7 years ago
gongweibao fbbdc9ccad
Add backward and optimizer operator dependency pass. (#17746)
7 years ago
gongweibao 65bbf950ee
Add multi-ncclcomm and 2D ncclallreduce support. (#17263)
7 years ago
Yan Xu 0217555530 polish parallel dygraph code (#17164)
7 years ago
Yan Xu 0b07eef118
ParallelDyGraph with GPU collective mode (#16827)
7 years ago
tangwei12 1a4a51db2b
Fleet unify distributed training (#16791)
7 years ago
Qiao Longfei 61912e879d test_dist_base set runtime_split_send_recv to false test=develop
7 years ago
Qiao Longfei d8974e6da0 Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into add-async-ssa-graph-executor-communicator
7 years ago
gongweibao eb83abeac3
Add DGC(Deep Gradient Compression) interface. (#15841)
7 years ago
Qiao Longfei 30618409db Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into add-async-ssa-graph-executor-communicator
7 years ago
Wu Yi 8bebfe5640
add resnet nccl2 dist training, mp training unit test (#16167)
7 years ago
Wu Yi 6382b62f6b
Collective ops (#15572)
7 years ago