Commit Graph

165 Commits (0073f9bdb0b43a8d298346e28a2b403fe351bac3)

Author SHA1 Message Date
tianshuo78520a d2ba91aad1
fix typo words (#22653)
5 years ago
tangwei12 66a3150135
SYNC with communicaotor (#22344)
5 years ago
123malin 00594c1c88
support dumping params/grads in transpiler mode (#22490)
5 years ago
123malin e59463efc7
test=develop, add distributed tools (#22623)
5 years ago
tangwei12 1aab3e61c9
add texttable for pretty flag output (#22584)
5 years ago
tangwei12 b0675c8193
fix bug with compiledProgram (#22495)
5 years ago
yaoxuefeng 2235ee1a5e
multi-loss optimization by adding a DownpourOpt worker (#22025)
5 years ago
xujiaqi01 6e4f39a061
add hdfs ls retry time and sleep time, fix save inference (#22433)
5 years ago
tangwei12 7e2665c58b
fix bug with half (#22378)
5 years ago
xujiaqi01 371f377bea
add GeneralRoleMaker (#22295)
5 years ago
tangwei12 82bc814a57
integrated HALF_ASYNC to communicator (#21869)
5 years ago
123malin 7fb817d447
add distributed_strategy (#21710)
6 years ago
WangXi 3ec289a6a3 fix sync_batch_norm hang in fleet (#21838)
6 years ago
lilong12 da75ac8b6c bugfix: construct a DistributedStrategy instance if the passed one is None (#21545)
6 years ago
xujiaqi01 f1178e9d79
fix fleet save bug (#21362)
6 years ago
Zhen Wang be2e3e67d9
Fix some typos in AMP. (#21354)
6 years ago
Thunderbrook 9a7832f8be
print table stat info for pslib (#21296)
6 years ago
xujiaqi01 319d2ba925
fix fs_client_param bug (#21212)
6 years ago
Thunderbrook 0d17c1b816
solve pslib core in stop worker (#21263)
6 years ago
xujiaqi01 eca66f317e
fix fleet util bug (#21254)
6 years ago
Thunderbrook 349e82d669
support general embedding params (#21217)
6 years ago
Dong Daxiang ccbdd7aad0
update worker_num for MPISymetricRoleMaker (#20798)
6 years ago
xujiaqi01 23876de55b
fix cache table bug, add save_paddle_inference_model, fix hdfs util bug (#21052)
6 years ago
xujiaqi01 9e045170c0
add copy table (#21086)
6 years ago
lilong12 53148e0696
modify the implementation of save_persistables and save_inference_model for fleet collective mode (#20802)
6 years ago
Thunderbrook 5970e8ac5e
find lookup table in order (#20932)
6 years ago
Chengmo 16596f6498
Fix Paddle Cloud role maker (#20860)
6 years ago
Bai Yifan ac87d4e6e1
fix hdfs.download, test=develop (#20907)
6 years ago
Thunderbrook 59bcdc8a19
support dump param of model into afs (#20302)
6 years ago
xujiaqi01 48669aa8f0
fix several sparse table issuses (#20686)
6 years ago
xujiaqi01 5223b0dd9d
add check nan / inf in downpour worker (#20694)
6 years ago
Chengmo 940c6ff1c8
Fix communicator slow bug & fix communicator stop bug (#20366)
6 years ago
WangXi cadc6a9704 fix dgc test and bug when not set trainers_endpoints_, test=develop (#20617)
6 years ago
mapingshuo f55d1c6867
Fleet: deal with special case: strategy is None (#20359)
6 years ago
Thunderbrook f76a32df4a
dump fix dov vec file num (#20539)
6 years ago
zhang wenhui b521992041
fix converter , test=develop (#20522)
6 years ago
zhang wenhui b82e6520e1
fix pslib datanorm double bug (#20297)
6 years ago
zhang wenhui b28d4a824f
fix fleet_desc delete_after_unseen_day bug in node.py (#20091)
6 years ago
Chengmo 728ec1b43d
Add GEO-SGD distribute training algorithm (#20018)
6 years ago
xujiaqi01 cedc04775c
support change shuffle and train thread num (#19841)
6 years ago
mapingshuo 9901f69677
Forward recompute3 (#19913)
6 years ago
tangwei12 278dd00322
paddle cloud role maker fix (#19646)
6 years ago
gongweibao e8d3745c0f
change _origin_program test=develop (#19863)
6 years ago
xujiaqi01 6bf298bf09
support preload thread, optimize hdfs log, fix master+patch bug (#19695)
6 years ago
gongweibao 6c2bc29cc0
Fix float16 optimizer. (#19682)
6 years ago
123malin a25a716e87
Optimize fleet API: add input check for some interfaces (#18971)
6 years ago
123malin 2f037c3189
fix the diff between async mode and async_half mode (#19535)
6 years ago
yaoxuefeng 10ca3f9609
add thread scope stat accurate metrics test=develop (#19480)
6 years ago
Thunderbrook 1fe468d319
support debug each output of each ins (#19004)
6 years ago
zhang wenhui bd35a7f0a6
support fc sort by number, test=develop (#19466)
6 years ago
Yi Liu 4ef6b8457a
adapte fleet api for localsgd and support nccl comm configuration in executor (#19443)
6 years ago
tangwei12 65c7368400
Fix the correctness of async mode at distributed training (#18863)
6 years ago
zhang wenhui 0d7949831b
fix fleet_desc bug && support format for abacus hotstart (#19430)
6 years ago
zhang wenhui 4a3c4b8fa4
add fleet_desc config feature & multi_sparse table, test=develop (#18827)
6 years ago
gongweibao 86f0591175
Remove node_num function. (#19167)
6 years ago
jiaqi b86be13c15
fix default value (#19193)
6 years ago
jiaqi b104ea0684
add get_last_save_xbox_base/get_last_save_xbox (#19122)
6 years ago
jiaqi bfd514c730
fix default value of fleet desc (#19176)
6 years ago
gongweibao 29d8781240
Polish fleet API to support cuda collective mode and nccl2 mode. (#18966)
6 years ago
yaoxuefeng 9150cf50fc
add save cache model api in fleet& add slots shuffle in dataset module & add metric op to calculate ctr related metrics (#18871)
6 years ago
jiaqi a99bc64c63
add fleet util, add some interface in hdfs util (#18752)
6 years ago
jiaqi 02c370c3dc
support filelist size < trainer num && fix pull dense (#18956)
6 years ago
jiaqi 768059b3a0
adjust ins weight according to nid slot (#18784)
6 years ago
jiaqi 233746d89d
set fleet_send_batch_num a default value according to trainer num
6 years ago
Thunderbrook 52c1431eee
add clear_model interface in fleetwrapper (#18815)
6 years ago
guru4elephant 30562e371b
refine launch_ps and role_maker (#18795)
6 years ago
fuyinno4 c167a4b4dd
Fix shrink-dense and add scale-datanorm (#18746)
6 years ago
Thunderbrook d8396281ef
add slot to sparse table (#18686)
6 years ago
jiaqi d18aabb472
support patch data, add load_one_table, fix bug (#18509)
6 years ago
tangwei12 d845848341
do some odd jobs (#18641)
6 years ago
guru4elephant 9c17a899d7
upgrade collective fleet api (#18533)
6 years ago
guru4elephant 1f1cc2221f
add random port (#18504)
6 years ago
guru4elephant 357311fdb7
make fleet support mpi job submit directly (#18441)
6 years ago
tangwei12 999d9a59a5
fix communicator with pyreader (#18350)
6 years ago
HaoRen b7128bac5f supports collective communicated training (#18175)
6 years ago
guru4elephant ff399fd720
fix paddle cloud role maker bug (#18269)
6 years ago
Qiao Longfei 23f8a4b1c3 assign role_maker before use (#18137)
6 years ago
guru4elephant 58f3e1bad7
add paddle cloud role maker for customized usage, note this is only for industrial users that have cloud environment pre-configuration (#18121)
6 years ago
tangwei12 4c735f24ea
fix bug in fleet, test=develop (#18058)
6 years ago
tangwei12 101f74cb19
fix save/load in fleet (#17675)
6 years ago
Kaipeng Deng 96ee528e3e
fix logging basicConfig cannot be setting after import paddle (#17786)
6 years ago
lilong12 b5c35ae3e7
add UserDefinedCollectiveRoleMaker for collective mode (#17898)
6 years ago
Qiao Longfei 58f7695ab2
Async exe support communicator (#17386)
6 years ago
jiaqi 05df39ac06
support sparse table get shard_num from TableParameter (#17443)
6 years ago
jiaqi 34369944f5
support config file, cvm, load, save, shrink (#17319)
6 years ago
tangwei12 565d309501
Reformat fleet API (#17135)
6 years ago
tangwei12 1a4a51db2b
Fleet unify distributed training (#16791)
6 years ago
jiaqi 7968887fae
Merge branch 'develop' into dataset_merge_develop
6 years ago
dongdaxiang ceac9df87a fix code style for incubator
6 years ago
xjqbest e784884e70 add Example in doc string of split_filelist
6 years ago
xjqbest 1c0ef929f9 fix code style
6 years ago
xujiaqi01 1938132936 fix code style
6 years ago
xjqbest d5ee580c5c move split filelist from trainer.py to fleet & fix error
6 years ago
xjqbest 126d2a2f9d fix init_worker bug
6 years ago
xjqbest 7a759d76cd fix code style
6 years ago
xjqbest 5e5139283b fix runtime error
6 years ago
xjqbest a99c8d0c29 fix client to client communication bug
6 years ago
xjqbest a38b98cb32 fix code style & runtime error
6 years ago
dongdaxiang 17790188d0 make role maker and distributed optimizer private
6 years ago
xjqbest d52586a97d add doc string
6 years ago