Commit Graph

169 Commits (9ebf05b003ab910bac2636496ef89d43927b7e60)

Author SHA1 Message Date
liuyuhui 9ebf05b003
[Kunlun]Multi xpu dygraph performance optimization , add distributed.spawn support for multi xpu and some bug-fixes (#31130)
4 years ago
danleifeng d1075df2e8
topo and memory performance for heterps (#30440)
4 years ago
lilong12 dc8dfba35b
align the default value of some configuration for fleet to that of single cards (#30740)
4 years ago
tangwei12 ebbdf52557
fix entry (#31079)
4 years ago
123malin 16b4260b2f
test=develop, save/load, shrink (#30625)
4 years ago
liuyuhui 4a8b8b4547
[Kunlun] add gen_bkcl_id_op, support multi XPU cards training using multiprocess (#30858)
4 years ago
WangXi b1026f64af
【kunlun】dygraph supports multi xpu card training (#30671)
4 years ago
WangXi 31ed9c9eed
Fleet distributed strategy support pure fp16 (#30754)
4 years ago
Zhen Wang 4a9de931a2
Fix the bug in fleet amp_init. (#30606)
4 years ago
huangxu96 138620084c
Add fleet amp_init() (#30572)
4 years ago
lilong12 8126a41d73
fix the bug of all_reduce pipeline gradient multiple times (#30437)
4 years ago
tangwei12 c9e78a22c5
add trainers for pserver (#30523)
4 years ago
hutuxian 9fec1618d2
Ascend Framework Part3: Ascend Parser (#30391)
4 years ago
123malin 05f06d9ae1
test=develop, fix fleet.metric (#30438)
4 years ago
Chengmo 859431aadb
fix ps init(#30397)
4 years ago
123malin 2a98e9323a
test=develop, add distributed_infer (#30300)
4 years ago
JZ-LIANG 75936d838f
Recompute Offload (#30233)
4 years ago
tangwei12 25f80fd304
Fix/distributed proto (#29981)
4 years ago
Chengmo d479ae1725
【Paddle.Fleet】Support local save sparse param (#30175)
4 years ago
Chen Weihang 3016ba852e
remove distributed prepare context (#30219)
4 years ago
Chengmo 528e03fc08
【Paddle.Fleet】Fix tensor table (#30075)
4 years ago
Chen Weihang 8020e34e7c
Simplify the options of spawn based on fleetrun (#30144)
4 years ago
gongweibao 4d2a4bb27a
fix logs info test=develop (#30071)
4 years ago
WangXi ab04997846
[fleet] combine amp and gradient merge, test=develop (#30086)
4 years ago
gongweibao eea7090c26
fix selected_gpus test=develop (#30044)
4 years ago
Chen Weihang 46c4695421
Set FLAGS_selected_gpus for spawn (#29962)
4 years ago
lilong12 b0bd93de00
Disable gloo by default (#29805)
5 years ago
lilong12 2bc5121da8
add the paddle.distributed.split api (#29970)
5 years ago
lilong12 01950ceb42
fix the bug in pipeline data parallelism (#29731)
5 years ago
tangwei12 032414ca2a
[Feature] one ps (3/4) (#29604)
5 years ago
ShenLiang 01e2874a0e
Support multi-stream communication for dynamic graph distributed (#29525)
5 years ago
WangXi 9cbcc6cadc
fleet sync build strategy, test=develop (#29732)
5 years ago
JZ-LIANG d33d468f02
[Sharding] add hybrid-dp feature (#29518)
5 years ago
ShenLiang 2ef9e0e23c
Rebuild group automatically in dynamic graph distributed (#29255)
5 years ago
lilong12 b122d0bb76
Fix bug in gloo that gloo initialization hangs (#29447)
5 years ago
ShenLiang 4064354a01
support dp run single card (#29358)
5 years ago
gongweibao 96de8b008f
cleanup enum test=develop (#29294)
5 years ago
ShenLiang 2d6aa1a5bb
fix warning of fleet (#29317)
5 years ago
ShenLiang 2cd0bf5764
Fix doc of fleet api (#29282)
5 years ago
ShenLiang 46b73e6cd9
Change the api of DataParallel and Fleet (#29224)
5 years ago
123malin cc9c619679
test=develop, fix doc (#29200)
5 years ago
WangXi 0c2a51d240
optimizer amp, all use fp16 communication, overlap last comm and compute (#28957)
5 years ago
123malin 92817f8005
test=develop, rm pathlib (#28658)
5 years ago
ShenLiang e2d01eb650
Support dynamic graph distributed (#28997)
5 years ago
Chen Long d576d6ddeb
fix some docs test=develop;test=document_fix (#29159)
5 years ago
lilong12 216e085605
update, test=develop (#29139)
5 years ago
lilong12 a1add716bc
Add a flag to control whether to initialize gloo (#29150)
5 years ago
ShenLiang cddc70964d
fix InMemoryDataset doc (#28688)
5 years ago
JZ-LIANG 0dadacc4eb
[sharding] doc, api, bug fixed (#28983)
5 years ago
lilong12 2a864c70c4
fix the bug in gloo (#29112)
5 years ago