Commit Graph

35 Commits (3a88acd2ee2fa46ac34da755fa49b7193e17a525)

Author SHA1 Message Date
tangwei12 202bfab1be
Feature/large scale kv save base/delta (#27470)
5 years ago
123malin a4f850748a
【paddle.fleet】bug fix for parameter_recv (#27838)
5 years ago
MRXLT 20fb01fb00
fix distributed error info (#27206)
5 years ago
wanghuancoder df43905f12
use iwyu clean include (#27267)
5 years ago
Chengmo 7f2aa2db3c
【paddle.fleet】Support Heter Parameter Server (#25998)
5 years ago
tangwei12 caa90a6510
Integrated Trainer of Parameter Server (API add `fluid.contrib.layers.sparse_embedding` only) (#22957)
5 years ago
tangwei12 4b3778a3ee
Revert/barrier for sync (#25417)
5 years ago
tangwei12 be6a315fbd
Fix/sync barrier (#25016)
5 years ago
Chen Weihang d1062d5278
Replace all errors thrown by LOG(FATAL) with PADDLE_THROW (#24759)
5 years ago
qingqing01 6162cf2f2e
Make optimizer consistent in dygraph and static-graph and remove some LOG-INFO. (#23426)
6 years ago
Wilber de009152a7 Compile without nccl deps. [2/2] (#22484)
6 years ago
123malin 985bceac53
Bug fix for sparse recorder (#21969)
6 years ago
123malin 20cdff0e02
Optimize decay (#20816)
6 years ago
gongweibao c1710e91b2
Disable GRPC_ARG_ALLOW_REUSEPORT to avoid potencial problem. (#20690)
6 years ago
gongweibao f3f52fc1e2
Retry when failed to bind address. (#20642)
6 years ago
123malin b4a3b75002
bug fix: invalid learning rate decay in pserver async mode (#20325)
6 years ago
tangwei12 b5a410466c
Trainer heartbeat for async mode (#19600)
6 years ago
123malin 1bc285a53a
add retry function to try to solve grpc error code 14 (#19661)
6 years ago
gongweibao 29d8781240
Polish fleet API to support cuda collective mode and nccl2 mode. (#18966)
6 years ago
Qiao Longfei 8b8a0487c7 fix compile test=develop
7 years ago
Qiao Longfei a541c25ab6 fix cpplint test=develop
7 years ago
Qiao Longfei 0608f8ca56 Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into add-async_sparse_param_update_recorder
7 years ago
gongweibao bf606bce8a
Fix grpc log message. (#16735)
7 years ago
Qiao Longfei 392e97aae5 fix cpplint test=develop
7 years ago
Qiao Longfei 0ff1e64fab fix a bug
7 years ago
Qiao Longfei 103c9bb376 update rpc_client
7 years ago
Qiao Longfei d5c7898201 complete pserver side update
7 years ago
Qiao Longfei 065b68b6ca clean code
7 years ago
Qiao Longfei 347178bd97 fix pserver memory leak
7 years ago
Qiao Longfei b8491bfd4e Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into add-communicator
7 years ago
Dun a83e470405
Profiler refine and add CUDA runtime api tracer (#15301)
7 years ago
Qiao Longfei be72940b76 Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into add-communicator
7 years ago
tangwei12 8b50ad80ff
checkpoint at distributed training (#14854)
7 years ago
Qiao Longfei 9958775b31 add NewTmpScope to scope
7 years ago
Wu Yi a8bc05b5ff
Refactor distributed RPC (#15075)
7 years ago