* test=develop
Add the op of unique_with_counts, the op is calc the unqiue input of data, and output the corresponding indices and count of data.
* test=develop
Check the input and dtype in the op of unique_with_counts
* test=develop
test=document_preview
update the API.spec for `unique_with_counts`, at the same time, optimize the python api in the op of `unique_with_count`
* test=develop
test=document_preview
Fix some python api problem in the op of `unique_with_counts`, and change the error messsage in this op.
* Fix some API problem in the op of `unique_with_counts`
test=develop
test=document_preview
* test=develop
test=document_preview
Fix the api sample of op `unique_with_counts`, and update api.spec
(1) set fleet_send_batch_num a default value according to trainer num, the previous 80000 is fixed,if trainer num is much less or larger than 100,global shuffle may have timeout error.
(2) fix load one table bug, add barrier
* support center loss
* change tensor copy api to high level api tensorcopy
* test=develop rewrite the center_loss cuda_kernel to make it faster
and add document of the center loss api,also update test function
* test=document_preview test=develop
update document of center loss
* test=document_preview test=develop
modify API.spec modify test code remove nouse const_cast
* extend matmul op to support multiple head multiplication
With the support of multiple head, the multiplication of two big matrixes is
split into multiplication of several (head_number) small matrixes. e.g. if
Mat A is [3, 24] and Mat B is [24, 4], when multiple A and B with head_number
as 4, Mat A will be split as 4 matrix of [3, 6] and Mat B will be 4 matrix of
[6, 4]. The result of final matrix will be 4 matrix of [3, 4], i.e. [3, 16].
The change includes 2 things:
1. save delta model and shrink table are control by the same parameter before, now add delete_after_unseen_days to control shrink table.
2. value in sparse table has no slot before, now add slot in sparse table, and add DownpureCtrAccessor to support the new meta.
test=develop
(1)support patch data (merge slots of instances of same line id, modify dense layer which
changes its size)
(2)add fleet load_one_table interface, support load from paddle model and load from pslib model
(3)fix push sparse bug which cause push sparse cost more time(about 10% in my testcase)
(4)when some slots are not in one of your network (join/update, etc.),data feed、collect label info、push/pull sparse will skip these slots, instead of throw error.
(5)add more debug info in TrainFilesWithProfiler
The change includes 3 things:
1. Set CPU_NUM to 1 in the tests because the ParallelExecutor will print warning that CPU_NUM is not set and use default 1.
2. Old tests compare two RNNs, hand written simple RNN and same RNN built by Paddle, but initialized RNN weights in numpy random and Paddle random separately. Fixed it by setting weights and bias values.
3. Also set numpy random seed in the tests. Now the two RNNs diff can be smaller (rtol from 0.1, 0.2 to. 0.01) in the tests.
test=develop
Test PaddingRNN on V100 GPU device.
Test configuration: large model, padding mode (which is the mode using recurrentOp), one GPU.
GPU memory (MiB): 6414 (this PR) vs 6837 (without this PR)
Speed (steps/s): 10.28 (this PR) vs 9.89 (without this PR)
* feature/auto_growth_allocator, test=develop
* add unittest of AlignedAllocator, test=develop
* try to turn on auto_growth to test on CI, test=develop
* fix segmentation fault in mixed_vector.h, test=develop
* add unittests, test=develop
1. Since allreduce op has 4 reduce types, We split these four reduce types into four ops
2. We also refined the collective op code, e.g. we separated the collective op kernel into CPUKernel and CUDAKernel, and remove the device specified DeviceContext parameter in template as we already knew the target DeviceContext
3. We remove the newly added Collective op role to reduce the complexity of program and graph analysis
* fix prepare context redundant code problem, optimize executor by caching create_varaiables
test=develop
* supports collective training in executor
* make fetch_list runable with variables, add more unittest for use_program_cache
test=develop
* fix comment
test=develop
* use unique name for nccl_id
* supports output to stream in program_to_code
* insert sync_comm_stream before regularization; add skip_op_callstack capability in program_to_code
* set op role in collective training
* add collective op role
* remove orig file
* add build optimizer by strategy
* add collective strategy
* refine collective strategy
* add multi-process role maker
* refine strategy building factory so that we can easily plugin more strategy
* scale loss grad in collective sgd transpiler
* add support for distributed fc
* code format
* revert some features for dist fc
* add support for distributed fc training
* fix prepare context redundant code problem, optimize executor by caching create_varaiables
test=develop
* supports collective training in executor
* make fetch_list runable with variables, add more unittest for use_program_cache
test=develop
* use unique name for nccl_id
* supports output to stream in program_to_code
* insert sync_comm_stream before regularization; add skip_op_callstack capability in program_to_code
* set op role in collective training
* add collective op role
* fix comment
test=develop
* remove orig file
* add build optimizer by strategy
* add collective strategy
* refine collective strategy
* add multi-process role maker
* refine strategy building factory so that we can easily plugin more strategy
* scale loss grad in collective sgd transpiler
* add support for distributed fc
* code format
* revert some features for dist fc
* add support for distributed fc training
* test=develop
add collective op unittest standard
* test=develop
remove the test_collective directory
* test=develop
remove the test_collective directory
* remove slicegather test
* code format for reducescatter
* update attr of shard_index_op
* Modify macro nccl_helper
* remove test without distribute
* macro collective_helper
* marcro update
* test=develop
update support python3.5
* test=develop change gpu memory use to 0.1 when test
* test=develop
update ut equal func
* test=develop
set flags to 1.5
* test=develop fix pickle dumple py35
* test=develop
fix divide in slice and add sync_comm_stream
update atol and rtol to 1e-05
rm shard_index op and test
modify read input from file to read from memory
remove origin_program in framework and add i/o in c_sync_calc_stream
* test=develop update unittest sync operator I/O
1. fix the bug that out_put_var in SaveSelectedRows would be empty string
2. use merge_sparse_lookup_table to replace sum op for load_persistables_for_inference
3. fix the bug in _clone_var_in_block_ when the var is SELECTED_ROWS.
(1) use channel instead of vector/BlockingQueue in Dataset,to keep same with existing implementation, and make code more readable and flexible (dataset single output channel or multi output channel). one previous memory out of limit problem is cause by not release memory after training.
(2) add Record because MultiSlotType costs too much memory (80B),fix memory out of limit problem.
(3) add Channel, Archive in paddle/fluid/framework
(4) change dataset from shared_ptr to unique_ptr in pybind
(5) move create/destroy readers from trainer to dataset
(6) move shuffle from datafeed to dataset. dataset holds memory, datafeed is only for load data and feed data to network.
(7) fix thread num bug of Dataset when filelist size < thread num
(8) support set_queue_num in InMemoryDataset
* test=develop, add add_multi_gpu_install_check
* test=develop, refine warning doc
* test=develop, refine warning doc
* test=develop, refine warning doc
* test=develop, support multi cpu
* test=develop, find right num of cuda device
* test=develop, find right num of cuda device
* test=develop, fix multigpu processing and fix type bug in dygraph
* test=develop, fix multigpu processing and fix type bug in dygraph