* add hard_swish activation op (new op)
test=develop
* remove redundancy files
* modify document content of HardSwish OP
* add API test in test_layers.py
* add dynamic_graph for test_hard_swish
* add a place field in DataFeed to denote which place it will feed data to.
* abstract the copy process in CopyToFeedTensor function
* add UT for float32 type and for CUDAPlace
* Add call stack info during runtime and compile time
test=develop
* Rename operator_call_stack
test=develop
* Add unit test
test=develop
* follow comment
test=develop
* add train demo for imdb text classification task
* make inference library release data_feed dataset dataset_factory data_feed_factory
* add String Data Generator
* new feature of train demo: save model params
* New feature of train demo: set training config using gflags
* change code style for CI
* add readme and dataset for imdb demo trainer
* fix QueueDataset queue size,set queue size = batch size * 100, to avoid too many instances in channel when training is much slower than reading data.
* fix warpctc.dll not found issue, test=develop
* revert the linux platform change, test=develop
* delete warpctc_lib_path.h.in, test=develop
* add SetPySitePackagePath function
* fix warpctc.dylib not found issue on Mac, test=develop
* improve the paddle lib path setting logic, test=develop
* fix mac ci issue caused by test_warpctc_op unittest, test=develop
* tweak code, test=develop
* open gc by default, test=develop
* fix test_train_recognize_digits and disable gc when ngraph is enabled, test=develop
* fix conditional_block op eager deletion bug, test=develop
* add some comments to reviewers, test=develop
* Fix Mask rcnn predictor
1. refine memory optim algorithm to support the model with the block op.
2. output diff : modify the affine channel fuse
3. add condition_block_infer op
add interface for setting trt calib table dir
test=develop
* add the missing files.
test=develop
* 1 add trt fp16 support
test=develop
* fix trt fp16 ce error
test=develop
* add an vlog if the user use trt4 and specify fp16.
test=develop
* support filelist size < trainer num
* pull dense when stop, to make sure local dense params are same as pserver, so save paddle model will save dense model same as pserver
* enable QueueDataset train same filelist for serveral times
* Fix memory leak in test
test=develop
* Fix memory leak in test
test=develop
* Fix memory leak in test
test=develop
* Pull out vars of the loops
test=develop
* test=develop
Add the op of unique_with_counts, the op is calc the unqiue input of data, and output the corresponding indices and count of data.
* test=develop
Check the input and dtype in the op of unique_with_counts
* test=develop
test=document_preview
update the API.spec for `unique_with_counts`, at the same time, optimize the python api in the op of `unique_with_count`
* test=develop
test=document_preview
Fix some python api problem in the op of `unique_with_counts`, and change the error messsage in this op.
* Fix some API problem in the op of `unique_with_counts`
test=develop
test=document_preview
* test=develop
test=document_preview
Fix the api sample of op `unique_with_counts`, and update api.spec
test=develop
- Extracted key generation from FWD and GRAD into separate function
test=develop
- Compilation fix
test=develop
- another compilation
test=develop
* fix security issue, test=develop
* bug fix, test=develop
* throw an exception when null pointer data with non-zero length PaddleBuf is passed, test=develop
* Fix Mask rcnn predictor
1. refine memory optim algorithm to support the model with the block op.
2. output diff : modify the affine channel fuse
3. add condition_block_infer op
add interface for setting trt calib table dir
test=develop
* add the missing files.
test=develop
* 1 add trt fp16 support
test=develop
* support center loss
* change tensor copy api to high level api tensorcopy
* test=develop rewrite the center_loss cuda_kernel to make it faster
and add document of the center loss api,also update test function
* test=document_preview test=develop
update document of center loss
* test=document_preview test=develop
modify API.spec modify test code remove nouse const_cast
* change INT8 to template so that checking dst_dt with if-else could be removed. CI will be enabled after fixing reviews
* reverse user_residual_memory_p and user_bias_memory_p declaration scope
test=develop
* extend matmul op to support multiple head multiplication
With the support of multiple head, the multiplication of two big matrixes is
split into multiplication of several (head_number) small matrixes. e.g. if
Mat A is [3, 24] and Mat B is [24, 4], when multiple A and B with head_number
as 4, Mat A will be split as 4 matrix of [3, 6] and Mat B will be 4 matrix of
[6, 4]. The result of final matrix will be 4 matrix of [3, 4], i.e. [3, 16].
* update paddle-trt for:
1. fix bug: when batch > 2, core in split plugin.
2. add leaky_relu trt5.0 support (yolov3 from 65ms to 42ms.)
3. add new attr to dropout.
4. shuffle channel, swish, relu6 support
test=develop
* 1. fix ci
test=develop
The change includes 2 things:
1. save delta model and shrink table are control by the same parameter before, now add delete_after_unseen_days to control shrink table.
2. value in sparse table has no slot before, now add slot in sparse table, and add DownpureCtrAccessor to support the new meta.
test=develop
(1)support patch data (merge slots of instances of same line id, modify dense layer which
changes its size)
(2)add fleet load_one_table interface, support load from paddle model and load from pslib model
(3)fix push sparse bug which cause push sparse cost more time(about 10% in my testcase)
(4)when some slots are not in one of your network (join/update, etc.),data feed、collect label info、push/pull sparse will skip these slots, instead of throw error.
(5)add more debug info in TrainFilesWithProfiler
Test PaddingRNN on V100 GPU device.
Test configuration: large model, padding mode (which is the mode using recurrentOp), one GPU.
GPU memory (MiB): 6414 (this PR) vs 6837 (without this PR)
Speed (steps/s): 10.28 (this PR) vs 9.89 (without this PR)
optimize the error reporting information of cuda related API
index on develop: 130ac17 Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into develop
* feature/auto_growth_allocator, test=develop
* add unittest of AlignedAllocator, test=develop
* try to turn on auto_growth to test on CI, test=develop
* fix segmentation fault in mixed_vector.h, test=develop
* add unittests, test=develop
* Add GPU implementation for `prelu` backward pass
test=develop
* Fix logic error in `prelu` GPU backward and simplify a bit
test=develop
* Fix `prelu` backward CUDA implementation
test=develop
CPU version was not used actually, so test passed
* update anakin-engine interfaces for content-dnn
test=develop
* support only-gpu mode of Anakin
modify eltwise parse
test=develop
* modification for thread-safe
test=develop
* Integrated template instance
test=develop
* increase template parameters
test=develop
* support MLU predictor
test=develop
* update anakin cmake files
test=develop
* update TargetWrapper::set_device
* update the initialization of anakin subgraph
test=develop
* use the default constructor of base class
test=develop
* load model from buffer with length
test=develop
* modify the access level of class
test=develop
* support anakin for bitmain arch
test=develop
* remove files
* checkout cmakelists
test=develop
* modify interfaces
test=develop
* add cmake dependments
test=develop
* enforce the outputs of net
test=develop
* not use transferscope cache in cpu case
test=develop
* adjust variable name and add comments
test=develop
* use correct format for class member in operator.h
* use correct format for class member in operator.cc
test=develop
* Fix Mask rcnn predictor
1. refine memory optim algorithm to support the model with the block op.
2. output diff : modify the affine channel fuse
3. add condition_block_infer op
add interface for setting trt calib table dir
test=develop
* add the missing files.
test=develop
* update anakin-engine interfaces for content-dnn
test=develop
* support only-gpu mode of Anakin
modify eltwise parse
test=develop
* modification for thread-safe
test=develop
* Integrated template instance
test=develop
* increase template parameters
test=develop
* support MLU predictor
test=develop
* update anakin cmake files
test=develop
* update TargetWrapper::set_device
* update the initialization of anakin subgraph
test=develop
* use the default constructor of base class
test=develop
* load model from buffer with length
test=develop
* modify the access level of class
test=develop
* support anakin for bitmain arch
test=develop
* remove files
* checkout cmakelists
test=develop
* rename mkldnn set/get_cur_thread_id() to set/get_cur_mkldnn_session_id()
test=develop
* update session id definition and adjust logic for default behavior
test=develop
* reset logic in mkldnn reuse as most of cases work in default.
test=develop
1. Since allreduce op has 4 reduce types, We split these four reduce types into four ops
2. We also refined the collective op code, e.g. we separated the collective op kernel into CPUKernel and CUDAKernel, and remove the device specified DeviceContext parameter in template as we already knew the target DeviceContext
3. We remove the newly added Collective op role to reduce the complexity of program and graph analysis
* Fix bug in quantize kernel which cause crash in vgg16/19 model
test=develop
* refine the code to reduce verbose code; test=develop
* remove useless code; test=develop
1. some key generation method is not aligned with PR#17965
2. enlarge ptr lifetime to avoid memory release if SetBlob fails
otherwise it will get core dump.
test=develop
* fix prepare context redundant code problem, optimize executor by caching create_varaiables
test=develop
* supports collective training in executor
* make fetch_list runable with variables, add more unittest for use_program_cache
test=develop
* fix comment
test=develop
* use unique name for nccl_id
* supports output to stream in program_to_code
* insert sync_comm_stream before regularization; add skip_op_callstack capability in program_to_code
* set op role in collective training
* add collective op role
* remove orig file
* add build optimizer by strategy
* add collective strategy
* refine collective strategy
* add multi-process role maker
* refine strategy building factory so that we can easily plugin more strategy
* scale loss grad in collective sgd transpiler
* add support for distributed fc
* code format
* revert some features for dist fc
* add support for distributed fc training
* fix prepare context redundant code problem, optimize executor by caching create_varaiables
test=develop
* supports collective training in executor
* make fetch_list runable with variables, add more unittest for use_program_cache
test=develop
* use unique name for nccl_id
* supports output to stream in program_to_code
* insert sync_comm_stream before regularization; add skip_op_callstack capability in program_to_code
* set op role in collective training
* add collective op role
* fix comment
test=develop
* remove orig file
* add build optimizer by strategy
* add collective strategy
* refine collective strategy
* add multi-process role maker
* refine strategy building factory so that we can easily plugin more strategy
* scale loss grad in collective sgd transpiler
* add support for distributed fc
* code format
* revert some features for dist fc
* add support for distributed fc training
* test=develop
add collective op unittest standard
* test=develop
remove the test_collective directory
* test=develop
remove the test_collective directory
* remove slicegather test
* code format for reducescatter
* update attr of shard_index_op
* Modify macro nccl_helper
* remove test without distribute
* macro collective_helper
* marcro update
* test=develop
update support python3.5
* test=develop change gpu memory use to 0.1 when test
* test=develop
update ut equal func
* test=develop
set flags to 1.5
* test=develop fix pickle dumple py35
* test=develop
fix divide in slice and add sync_comm_stream
update atol and rtol to 1e-05
rm shard_index op and test
modify read input from file to read from memory
remove origin_program in framework and add i/o in c_sync_calc_stream
* test=develop update unittest sync operator I/O