Paddle

Commit Graph

Author	SHA1	Message	Date
tianshuo78520a	d2ba91aad1	fix typo words (#22653 )	5 years ago
tangwei12	66a3150135	SYNC with communicaotor (#22344 ) * add sync communicator and implement	5 years ago
123malin	00594c1c88	support dumping params/grads in transpiler mode (#22490 )	5 years ago
123malin	e59463efc7	test=develop, add distributed tools (#22623 )	5 years ago
tangwei12	1aab3e61c9	add texttable for pretty flag output (#22584 ) pretty print for communicator flag	5 years ago
tangwei12	b0675c8193	fix bug with compiledProgram (#22495 ) * add thread barrier for the compiled program	5 years ago
yaoxuefeng	2235ee1a5e	multi-loss optimization by adding a DownpourOpt worker (#22025 ) * update * update test=develop * update compile set test=develop * update compile set test=develop * update test=develop * update test=develop * update test=develop * update compile setting test=develop * update compile setting test=develop * update run demo test=develop * update test=develop * update test=develop * fix test=develop * update test=develop * update test=develop * update test=develop * update test=develop * update test=develop * update test=develop * update test=develop * update test=develop * update test=develop * update format test=develop * update format test=develop * update style test=develop * update style test=develop * change style test=develop * change style test=develop * change style test=develop * add dataset unittest test=develop * update test=develop * update for record test=develop * udpate style for record test=develop * update for record test=develop * update for record test=develop * update for record test=develop * fix format test=develop * update test=develop * update test=develop * update test=develop * update test=develop * update test=develop	5 years ago
xujiaqi01	6e4f39a061	add hdfs ls retry time and sleep time, fix save inference (#22433 ) * add hdfs ls retry time and sleep time, fix save inference * test=develop	5 years ago
tangwei12	7e2665c58b	fix bug with half (#22378 ) * fix bug with half communicator	5 years ago
xujiaqi01	371f377bea	add GeneralRoleMaker (#22295 ) * add GeneralRoleMaker which is for general usage * test=develop	5 years ago
tangwei12	82bc814a57	integrated HALF_ASYNC to communicator (#21869 ) * add half_async in the communicator * fix DistributedStrategy	5 years ago
123malin	7fb817d447	add distributed_strategy (#21710 ) * add distributed_strategy	6 years ago
WangXi	3ec289a6a3	fix sync_batch_norm hang in fleet (#21838 )	6 years ago
lilong12	da75ac8b6c	bugfix: construct a DistributedStrategy instance if the passed one is None (#21545 )	6 years ago
xujiaqi01	f1178e9d79	fix fleet save bug (#21362 ) * fix fleet save bug of save_infernece_model * test=develop	6 years ago
Zhen Wang	be2e3e67d9	Fix some typos in AMP. (#21354 ) * fix some typos in AMP. test=develop * delete useless codes. test=develop	6 years ago
Thunderbrook	9a7832f8be	print table stat info for pslib (#21296 ) * print table stat test=develop * notes test=develop * notes test=develop	6 years ago
xujiaqi01	319d2ba925	fix fs_client_param bug (#21212 ) * fix fs_client_param bug， user can set this config through fleet_desc_file or fleet config * test=develop	6 years ago
Thunderbrook	0d17c1b816	solve pslib core in stop worker (#21263 ) * general table * add sparse table test=develop * no cvm test=develop * add no_cvm test=develop * add note test=develop * code style test=develop * code style test=develop * code style test=develop * code style test=develop * code style test=develop * add key of optimizer test=develop * solve pslib stop core test=develop * barrier test=develop * add notes test=develop	6 years ago
xujiaqi01	eca66f317e	fix fleet util bug (#21254 ) * fix fleet util bug in save paddle inference model * test=develop	6 years ago
Thunderbrook	349e82d669	support general embedding params (#21217 ) * general table * add sparse table test=develop * no cvm test=develop * add no_cvm test=develop * add note test=develop * code style test=develop * code style test=develop * code style test=develop * code style test=develop * code style test=develop * add key of optimizer test=develop	6 years ago
Dong Daxiang	ccbdd7aad0	update worker_num for MPISymetricRoleMaker (#20798 ) test=develop	6 years ago
xujiaqi01	23876de55b	fix cache table bug, add save_paddle_inference_model, fix hdfs util bug (#21052 ) * fix cache table bug * add save_paddle_inference_model * fix hdfs util bug * test=develop	6 years ago
xujiaqi01	9e045170c0	add copy table (#21086 ) * copy some feasigns and corresponding embeddings from one sparse table to another * copy all feasigns and corresponding embeddings from one sparse table to another * copy all dense params from one table to another * copy some local vars to other local vars	6 years ago
lilong12	53148e0696	modify the implementation of save_persistables and save_inference_model for fleet collective mode (#20802 ) * modify the implementation of save_persistables and save_inference_model functions for fleet collective, test=develop * add ut, test=develop	6 years ago
Thunderbrook	5970e8ac5e	find lookup table in order (#20932 ) test=develop	6 years ago
Chengmo	16596f6498	Fix Paddle Cloud role maker (#20860 ) * fix PaddleCloud Role maker & add warning in distribute transpiler & change rpc_retry_times	6 years ago
Bai Yifan	ac87d4e6e1	fix hdfs.download, test=develop (#20907 )	6 years ago
Thunderbrook	59bcdc8a19	support dump param of model into afs (#20302 ) * support dump param to afs test=develop * code style test=develop * code style test=develop * dump param test=develop * dump param test=develop * dump param test=develop * dump param test=develop	6 years ago
xujiaqi01	48669aa8f0	fix several sparse table issuses (#20686 ) * no longer need to define all embedding layers (no one less) of all slots in each program. make trainer_param repeated in ps.proto. * add find_distributed_lookup_table_grads instead of hard code GRAD * support embedding stop gradient. push sparse has error before fix this.* * fix fill sparse, skip slots which do not have embedding. each slot's embedding in a sparse table should be used in all training programs before fix this. * fix pull sparse, skip slots which do not have embedding. * fix collect feasign label info, skip slots which do not have embedding. * support when there are multi sparse tables in one or multi training programs, each program can pull/push its own related sparse tables instead of all sparse tables. * test=develop	6 years ago
xujiaqi01	5223b0dd9d	add check nan / inf in downpour worker (#20694 ) * add check nan / inf in downpour worker during training * test=develop	6 years ago
Chengmo	940c6ff1c8	Fix communicator slow bug & fix communicator stop bug (#20366 ) * test=develop,Fix communicator slow bug * test=develop, delete if() in stop_worker() * test=develop * fix UT, test=develop * fix bug in fetch handler, test=develop * fix bug in fetch handler, test=develop * test=develop, fix fetch barrier bug * test=develop, bug fix * test=develop, bug fix * test=develop, fix bug	6 years ago
WangXi	cadc6a9704	fix dgc test and bug when not set trainers_endpoints_, test=develop (#20617 )	6 years ago
mapingshuo	f55d1c6867	Fleet: deal with special case: strategy is None (#20359 ) * special case: strategy is None	6 years ago
Thunderbrook	f76a32df4a	dump fix dov vec file num (#20539 ) * support dump multi file test=develop * dump fix num file test=develop	6 years ago
zhang wenhui	b521992041	fix converter , test=develop (#20522 )	6 years ago
zhang wenhui	b82e6520e1	fix pslib datanorm double bug (#20297 ) * fix fc sort . test=develop	6 years ago
zhang wenhui	b28d4a824f	fix fleet_desc delete_after_unseen_day bug in node.py (#20091 )	6 years ago
Chengmo	728ec1b43d	Add GEO-SGD distribute training algorithm (#20018 ) * refector geo sgd & communicator	6 years ago
xujiaqi01	cedc04775c	support change shuffle and train thread num (#19841 ) * support change shuffle thread num * support change train thread num * fix receive shuffle data of each channel * data norm stop gradient * add check thread_tensor type and root_tensor type when merge metric * remove sleep in shuffle, add config * add config of pslib client to client communication * fix xbox str * add data norm op testcase * add flush in trainer finalize	6 years ago
mapingshuo	9901f69677	Forward recompute3 (#19913 ) * add recompute based checkpoints methods for large batch training test=develop * add append_backward_with_forward_recomputation test=develop * refine optimizer test=develop * update backward and optimizer test=develop * make Variable usable test=develop * add recompute code * refine optimizer test=develop * refine addup _append_backward_ops_with_checkpoints_ 1) for recompute part, just cache the grad_op_desc without appending to block 2) before appending grad_op_desc to backward part, addup_repetitive_vars, remove unused branch test=develop * make method private * add recompute strategy into DistributedStrategy test=develop * checkpoint version3 test=develop * remove some print information test=develop * remove unused sumop test=develop * try to fix recompute with graph building modules * add input names to vars should be held * add memory debug tool * backup backward * Fix bugs * add backward desc for op not in any segments * add exception info for sub_block test=develop * modify code style test=develop * modify code style test=develop * remove print functions test=develop * add API spec test=develop test=document_preview * make Recompute a child class of Optimizer test=develop test=document_preview * add API spec test=develop test=document_preview * modify API spec test=develop test=document_preview * add document for Recompute test=develop test=document_preview * change API doc of Rcompute test=develop test=document_preview * code cleaning test=develop test=document_preview * modify API spec * fix bugs when segments hold no element * add testcase for Recompute Optimizer test=develop test=document_preview * add test for apply_gradient, and code cleaning test=develop test=document_preview * add test case for load function * enable CI test=develop test=document * add test case test=develop test=document_preview * add sample code for 4 function of recompute optimizer test=develop test=document_preview	6 years ago
tangwei12	278dd00322	paddle cloud role maker fix (#19646 ) * optimize cloud rolemaker, test=develop	6 years ago
gongweibao	e8d3745c0f	change _origin_program test=develop (#19863 ) change _origin_program test=develop	6 years ago
xujiaqi01	6bf298bf09	support preload thread, optimize hdfs log, fix master+patch bug (#19695 ) * support preload thread * sleep before fleet wrapper exit for pslib core dump * optimize hdfs log * fix master+patch bug	6 years ago
gongweibao	6c2bc29cc0	Fix float16 optimizer. (#19682 ) Fix float16 optimizer	6 years ago
123malin	a25a716e87	Optimize fleet API: add input check for some interfaces (#18971 ) * fleet api add input check, test=develop	6 years ago
123malin	2f037c3189	fix the diff between async mode and async_half mode (#19535 ) * test=develop, communicator merge add => merge average	6 years ago
yaoxuefeng	10ca3f9609	add thread scope stat accurate metrics test=develop (#19480 ) * add thread scope stat accurate metrics test=develop * fix style * fix style * fix style * fix style test=develop * fix style test=develop * fix style test=develop * fix style test=develop * fix style test=develop * fix style test=develop * fix style test=develop * fix conflict * fix style * fix style test=develop * fix error test=develop * fix error test=develop	6 years ago
Thunderbrook	1fe468d319	support debug each output of each ins (#19004 ) * dump slot * test * proto * dump slot * test * proto * code style * code style * code style * style * add delete after unseen days * add unseen days * code style * conflict solve test=develop * add clear model * code style test=develop * code style test=develop * support debug tensor of each ins test=develop * support debug tensor of each ins test=develop * learning rate * code style * code style * code style * code style * code style * code style * code style * code style * code style * code style * code style * code style * code style test=develop * code style test=develop * unitest * style * style * multi phase * add channel * code style * style * style * unitest * style * define * define test=develop * style test=develop * rm define test=develop * linux * linux test=develop * style test=develop * output format test=develop * windows ci test=develop	6 years ago
zhang wenhui	bd35a7f0a6	support fc sort by number, test=develop (#19466 ) fleet_desc sort fc name by dictionary sort, but we want to sort by number.	6 years ago
Yi Liu	4ef6b8457a	adapte fleet api for localsgd and support nccl comm configuration in executor (#19443 ) test=develop	6 years ago
tangwei12	65c7368400	Fix the correctness of async mode at distributed training (#18863 ) * fix correctness of the communicator * fix a bug in send thread when sending var context is empty, test=develop * add lookup_table_prefetch_op and prefetch optimize, test=develop * remove remote prefetch GPU supported * word2vec force with CPU, test=develop * test dist remote lookup table force with CPU, test=develop	6 years ago
zhang wenhui	0d7949831b	fix fleet_desc bug && support format for abacus hotstart (#19430 ) fix fleet_desc dense_table unsort bug ，not support format for abacus hotstart yet.	6 years ago
zhang wenhui	4a3c4b8fa4	add fleet_desc config feature & multi_sparse table, test=develop (#18827 ) add fleet_desc config feature & multi_sparse table,	6 years ago
gongweibao	86f0591175	Remove node_num function. (#19167 ) node_num is not needed for users, so remove them and fix the bugs about it!	6 years ago
jiaqi	b86be13c15	fix default value (#19193 ) * fix default value in ps_pb2.py: delta_keep_days 30 -> 16 * test=develop	6 years ago
jiaqi	b104ea0684	add get_last_save_xbox_base/get_last_save_xbox (#19122 ) * add get_last_save_xbox_base/get_last_save_xbox * fix fleet_util bug of load paddle model * add doc string in fleet api	6 years ago
jiaqi	bfd514c730	fix default value of fleet desc (#19176 ) * fix default value of fleet desc, default values are same with jingpai * print log when save model	6 years ago
gongweibao	29d8781240	Polish fleet API to support cuda collective mode and nccl2 mode. (#18966 ) Polish fleet API to support cuda collective mode and nccl2 mode	6 years ago
yaoxuefeng	9150cf50fc	add save cache model api in fleet& add slots shuffle in dataset module & add metric op to calculate ctr related metrics (#18871 ) * add ctr related metric layer test=develop * add save cache and slots shuffle test=develop * add save cache and slots shuffle test=develop * fix error * fix error * fix style for ci * fix for comments * change SlotsShuffle input to std::strinf for generality * fix style * fix style * fix style * fix style * fix style * fix style * fix stylr * fix style * fix style * fix style * fix style * fix style * fix style * fix style * fix style * fix style * fix style * fix style * fix style * fix style * change non-const reference to pointer * fix style * fix style * fix style test=develop * fix style test=develop * add return ins num in ctr metric op * change dtype to float in metric_op.py * fix error test=develop * fix style test=develop * fix API spec * fix API spec * fix API spec test=develop * add UT test=develop	6 years ago
jiaqi	a99bc64c63	add fleet util, add some interface in hdfs util (#18752 ) * add fleet util (fleet/utils/fleet_util.py): functions for users' convenience * add some interface in hdfs util : hdfs is_file、hdfs cat	6 years ago
jiaqi	02c370c3dc	support filelist size < trainer num && fix pull dense (#18956 ) * support filelist size < trainer num * pull dense when stop, to make sure local dense params are same as pserver, so save paddle model will save dense model same as pserver * enable QueueDataset train same filelist for serveral times	6 years ago
jiaqi	768059b3a0	adjust ins weight according to nid slot (#18784 ) adjust ins weight according to nid slot , user can specify adjust_ins_weight in strategy	6 years ago
jiaqi	233746d89d	set fleet_send_batch_num a default value according to trainer num (1) set fleet_send_batch_num a default value according to trainer num， the previous 80000 is fixed，if trainer num is much less or larger than 100，global shuffle may have timeout error. (2) fix load one table bug, add barrier	6 years ago
Thunderbrook	52c1431eee	add clear_model interface in fleetwrapper (#18815 ) * dump slot * test * proto * dump slot * test * proto * code style * code style * code style * style * add delete after unseen days * add unseen days * code style * conflict solve test=develop * add clear model * code style test=develop * code style test=develop	6 years ago
guru4elephant	30562e371b	refine launch_ps and role_maker (#18795 ) refine launch_ps and role_maker	6 years ago
fuyinno4	c167a4b4dd	Fix shrink-dense and add scale-datanorm (#18746 ) Fix FleetWrapper: 1. fix shrink dense: just scale show 2. add datanorm scale: divide datanorm's gradient by batch_size	6 years ago
Thunderbrook	d8396281ef	add slot to sparse table (#18686 ) The change includes 2 things: 1. save delta model and shrink table are control by the same parameter before, now add delete_after_unseen_days to control shrink table. 2. value in sparse table has no slot before, now add slot in sparse table, and add DownpureCtrAccessor to support the new meta. test=develop	6 years ago
jiaqi	d18aabb472	support patch data, add load_one_table, fix bug (#18509 ) （1）support patch data （merge slots of instances of same line id, modify dense layer which changes its size）（2）add fleet load_one_table interface, support load from paddle model and load from pslib model （3）fix push sparse bug which cause push sparse cost more time（about 10% in my testcase）（4）when some slots are not in one of your network (join/update, etc.)，data feed、collect label info、push/pull sparse will skip these slots， instead of throw error. （5）add more debug info in TrainFilesWithProfiler	6 years ago
tangwei12	d845848341	do some odd jobs (#18641 ) do some odd jobs, test=develop	6 years ago
guru4elephant	9c17a899d7	upgrade collective fleet api (#18533 ) * upgrade collective fleet api	6 years ago
guru4elephant	1f1cc2221f	add random port (#18504 ) * add random port	6 years ago
guru4elephant	357311fdb7	make fleet support mpi job submit directly (#18441 ) make fleet support mpi job submit directly.	6 years ago
tangwei12	999d9a59a5	fix communicator with pyreader (#18350 ) * add is_runnning in communicator, test=develop	6 years ago
HaoRen	b7128bac5f	supports collective communicated training (#18175 ) * fix prepare context redundant code problem, optimize executor by caching create_varaiables test=develop * supports collective training in executor * make fetch_list runable with variables, add more unittest for use_program_cache test=develop * fix comment test=develop * use unique name for nccl_id * supports output to stream in program_to_code * insert sync_comm_stream before regularization; add skip_op_callstack capability in program_to_code * set op role in collective training * add collective op role * remove orig file * add build optimizer by strategy * add collective strategy * refine collective strategy * add multi-process role maker * refine strategy building factory so that we can easily plugin more strategy * scale loss grad in collective sgd transpiler * add support for distributed fc * code format * revert some features for dist fc * add support for distributed fc training * fix prepare context redundant code problem, optimize executor by caching create_varaiables test=develop * supports collective training in executor * make fetch_list runable with variables, add more unittest for use_program_cache test=develop * use unique name for nccl_id * supports output to stream in program_to_code * insert sync_comm_stream before regularization; add skip_op_callstack capability in program_to_code * set op role in collective training * add collective op role * fix comment test=develop * remove orig file * add build optimizer by strategy * add collective strategy * refine collective strategy * add multi-process role maker * refine strategy building factory so that we can easily plugin more strategy * scale loss grad in collective sgd transpiler * add support for distributed fc * code format * revert some features for dist fc * add support for distributed fc training * test=develop add collective op unittest standard * test=develop remove the test_collective directory * test=develop remove the test_collective directory * remove slicegather test * code format for reducescatter * update attr of shard_index_op * Modify macro nccl_helper * remove test without distribute * macro collective_helper * marcro update * test=develop update support python3.5 * test=develop change gpu memory use to 0.1 when test * test=develop update ut equal func * test=develop set flags to 1.5 * test=develop fix pickle dumple py35 * test=develop fix divide in slice and add sync_comm_stream update atol and rtol to 1e-05 rm shard_index op and test modify read input from file to read from memory remove origin_program in framework and add i/o in c_sync_calc_stream * test=develop update unittest sync operator I/O	6 years ago
guru4elephant	ff399fd720	fix paddle cloud role maker bug (#18269 ) * fix paddle cloud role maker bug	6 years ago
Qiao Longfei	23f8a4b1c3	assign role_maker before use (#18137 ) fix role_maker bug test=develop	6 years ago
guru4elephant	58f3e1bad7	add paddle cloud role maker for customized usage, note this is only for industrial users that have cloud environment pre-configuration (#18121 ) add paddle cloud role maker for specific cloud usage. This pr will simplifies user's configuration in distributed training.	6 years ago
tangwei12	4c735f24ea	fix bug in fleet, test=develop (#18058 )	6 years ago
tangwei12	101f74cb19	fix save/load in fleet (#17675 ) * fix save/load in Fleet * add UT framework of Fleet	6 years ago
Kaipeng Deng	96ee528e3e	fix logging basicConfig cannot be setting after import paddle (#17786 ) * fix logging unable. test=develop * unset sys.stdout for stream handler. test=develop * fix newly add basicConfig. test=develop * fix import error. test=develop	6 years ago
lilong12	b5c35ae3e7	add UserDefinedCollectiveRoleMaker for collective mode (#17898 ) * add 'UserDefinedRoleMakerNCCL' for collective mode. * code style * add the name UserDefinedRoleMakerNCCL to __all__ * rename to UserDefinedRoleMakerCollective * rename to UserDefinedCollectiveRoleMaker	6 years ago
Qiao Longfei	58f7695ab2	Async exe support communicator (#17386 ) Async exe support communicator	6 years ago
jiaqi	05df39ac06	support sparse table get shard_num from TableParameter (#17443 ) test=develop	6 years ago
jiaqi	34369944f5	support config file, cvm, load, save, shrink (#17319 ) * support config file, cvm, load, save, shrink test=develop * fix error of worker_num & add table.compress_in_save test=develop * fix code style test=develop * fix save model bug test=develop	6 years ago
tangwei12	565d309501	Reformat fleet API (#17135 ) * fix some logic in distributed transpiler, test=develop * reformat fleet API, test=develop	6 years ago
tangwei12	1a4a51db2b	Fleet unify distributed training (#16791 ) * implement distributed transpiler with fleet	6 years ago
jiaqi	7968887fae	Merge branch 'develop' into dataset_merge_develop	6 years ago
dongdaxiang	ceac9df87a	fix code style for incubator	6 years ago
xjqbest	e784884e70	add Example in doc string of split_filelist test=develop	6 years ago
xjqbest	1c0ef929f9	fix code style test=develop	6 years ago
xujiaqi01	1938132936	fix code style test=develop	6 years ago
xjqbest	d5ee580c5c	move split filelist from trainer.py to fleet & fix error test=develop	6 years ago
xjqbest	126d2a2f9d	fix init_worker bug test=develop	6 years ago
xjqbest	7a759d76cd	fix code style test=develop	6 years ago
xjqbest	5e5139283b	fix runtime error test=develop	6 years ago
xjqbest	a99c8d0c29	fix client to client communication bug test=develop	6 years ago
xjqbest	a38b98cb32	fix code style & runtime error test=develop	6 years ago
dongdaxiang	17790188d0	make role maker and distributed optimizer private	6 years ago
xjqbest	d52586a97d	add doc string test=develop	6 years ago

1 2 3 4

165 Commits (0073f9bdb0b43a8d298346e28a2b403fe351bac3)