Paddle

Commit Graph

Author	SHA1	Message	Date
liu zhengxi	ae2be49f40	Add cublas_handle() to expose cublas_handle to ops (#31157 ) * add get_cublas_handle() api * update format * add unittests * alter function name	4 years ago
wuhuanzhou	9b3c80c8ab	update eigen version on Windows (#30573 ) * update eigen version on Windows, test=develop * add /bigobj for cl, test=develop	4 years ago
Qi Li	93c1d9e761	[ROCM] update fluid platform for rocm39 (part3), test=develop (#30913 )	4 years ago
wanghuancoder	35c5b23f68	use iwyu clean include second time, test=develop (#30829 ) * use iwyu clean include second time, test=develop	4 years ago
WangXi	b1026f64af	【kunlun】dygraph supports multi xpu card training (#30671 )	4 years ago
QingshuChen	c35a9880f9	fix malloc L3 failed bug for kunlun (#30745 ) * fix malloc L3 failed bug for kunlun * minor	4 years ago
Jacek Czaja	173660be7b	[oneDNN] Cache oneDNN stream not to recreate in each oneDNN op (#30358 )	4 years ago
liuyuhui	843dc3cdbd	[Kunlun]PR3: add xpu executor, multi xpu card train function optimization (#30317 )	4 years ago
QingshuChen	8489d4f76f	optimize batch_norm & pool op for kunlun (#30490 )	4 years ago
QingshuChen	2c1bba02e4	optimize memcpy perf for kunlun (#30291 ) * optimize memcpy perf for kunlun * remove useless unitest for kunlun mean * minor	4 years ago
AshburnLee	924aac2216	Add tf32 switch for cuDNN (#29192 )	4 years ago
Jacek Czaja	c9e874fc8e	[oneDNN] Unit test for checking oneDNN caching (#29606 )	4 years ago
liuyuhui	f13c3a9cd7	[Kunlun] PR1:Support one Kunlun card training in parallel executor (#29337 )	4 years ago
AshburnLee	efea540ca9	Add tf32 support for A100 tensor core acceleration for cuBLAS (#28732 )	4 years ago
arlesniak	62d4483649	Added verbose oneDNN lib version (#29378 )	4 years ago
Aurelius84	7ae3cb554a	Polish CUDA Information stdout (#29109 )	4 years ago
wawltor	b2c8a00745	remove eigen threadpool for the speed up remove eigen threadpool for the speed up	4 years ago
123malin	cc780b1977	test=develop, optimize geo communicator (#26857 ) * test=develop, optimize geo communicator	4 years ago
Jack Zhou	63203c4abc	enhance reduce op which can reduce tensor with arbitrary rank enhance reduce op which can reduce tensor with arbitrary rank	4 years ago
Adam	f3909020de	Add mechanism for blocking oneDNN cache clearing (#26502 ) * Add mechanism for blocking oneDNN cache clearing * Review changes and Add thread guards	5 years ago
QingshuChen	138ecf24aa	support Baidu Kunlun AI Accelerator (#25959 ) * support Baidu AI Accelerator * test=kunlun * minor * test=kunlun * support xpu op in separate file * test=kunlun * update XPU error message and remove duplicated code * test=kunlun * minor * test=kunlun * minor * test=kunlun	5 years ago
GaoWei8	c10dcff12d	refine PADDLE_ENFORCE (#25456 ) * Refine PADDLE_ENFORCE in paddle/fluid/platform test=develop	5 years ago
GaoWei8	ea7e532598	Refine PADDLE_ENFORCE (#25369 ) * refine PADDLE_ENFORCE test=develop	5 years ago
Chen Weihang	d1062d5278	Replace all errors thrown by LOG(FATAL) with PADDLE_THROW (#24759 ) * remove REPLACE_ENFORCE_GLOG compile option & add ci rule prohibit LOG(FATAL) using, test=develop * remove ci test case, test=develop * replace all LOG(FATAL) & polish message, test=develop * fix typo, test=develop * polish error info detail, test=develop	5 years ago
pawelpiotrowicz	db2b6b6568	Hide globals & redesign restore PR (#24279 ) test=develop	5 years ago
Chen Weihang	aa0f254fbe	Add macro BOOST_GET to enrich the error information of boost :: get (#24175 ) * add new macro BOOST_GET_SAFELY & unittests, test=develop * add different macro type, test=develop * fix get macro type in executor, test=develop * four macro part change backup * using one macro for all case, test=develop * revert attribute change, test=develop * change to three func to solve gcc4.8 bug, test=develop * polish some details, test=develop	5 years ago
Sylwester Fraczek	e1a7a88057	added reshape transpose matmul fuse pass (#23754 )	5 years ago
Guo Sheng	a8c0fb4e86	Add cholesky_op (#23543 ) * Add cholesky_op forward part. test=develop * Complete cholesky_op forward part. test=develop * Add cholesky_op backward part. test=develop * Complete cholesky_op backward part. test=develop * Refine cholesky_op error check and docs. test=develop * Add grad_check unit test for cholesky_op. test=develop * Fix sample code in cholesky doc. test=develop * Refine some error messages of cholesky_op. test=develop * Refine some error messages of cholesky_op. test=develop * Remove unused input in cholesky_grad. test=develop * Remove unused input in cholesky_grad. test=develop * Fix stream for cusolverDnSetStream. test=develop * Update PADDLE_ENFORCE_CUDA_SUCCESS from cholesky_op to adapt to latest code. test=develop * Add CUSOLVER ERROR in enforce.h test=develop * Fix the missing return value in cholesky. test=develop	5 years ago
石晓伟	34d7d6aef0	declare the stream::Priority as enum class, test=develop (#24013 )	5 years ago
Zhang Ting	b89dd86fb6	Update eigen (#23203 ) * update eigen, test=develop * remove patches, test=develop * add definition of -fabi-version, test=develop * add patch for TensorBlock.h, test=develop * test windows, test=develop * only update eigen for Linux, test=develop * add code comments, test=develop	5 years ago
石晓伟	2d01cc85c4	DeviceContext Split, test=develop (#23737 ) * supports thread-binding stream, test=develop * avoid using thread_local variables in dtor, test=develop * modify the stream priority enum, test=develop	5 years ago
石晓伟	5c59d2139e	reverts the commit 23177, test=develop (#23363 )	5 years ago
Yi Liu	0471476a18	fix nccl comm double free bug (#23344 ) As nccl comm is not created by CUDADeviceContext, it should be destroyed by the creator as the best practice of RAII.	5 years ago
石晓伟	75ebb48a91	supports thread-binding stream, test=develop (#23177 )	5 years ago
Wilber	7bc4b09500	add WITH_NCCL option for cmake. (#22384 ) cmake选项中添加了WITH_NCCL，显示指定是否编译NCCL的部分代码，WITH_NCCL默认打开，但如果WITH_GPU为OFF，则关闭WITH_NCCL 添加了PADDLE_WITH_NCCL定义单机单卡能够关闭NCCL编译，多卡的话需要默认打开NCCL，如果关闭NCCL，则只能使用单卡 Co-authored-by: 石晓伟 <39303645+Shixiaowei02@users.noreply.github.com>	5 years ago
zhaoyuchen2018	3d4f2aa689	Refine stack op to improve xlnet performance, test=develop (#22142 ) stack's wait cost a lot of cpu time, use cuda kernel to do memory copy will reduce cpu time. Signed-off-by: zhaoyuchen <zhaoyuchen01@baidu.com>	5 years ago
Adam	e81f0228df	MKL-DNN 1.0 Update (#20162 ) * MKLDNN v1.0 rebase to Paddle 1.6 test=develop * Add hacky paddle::string::to_string() implementation * vectorize<int64-t>() -> vectorize() cleanup test=develop * PADDLE_ENFORCE and void_cast fixes test=develop * Rebase changes test=develop * Cosmetics test=develop * Delete MKL from mkldnn.cmake test=develop * CMake debug commands test=develop * Delete MKLDNN_VERBOSE and rebase fixes test=develop * Rebase fixes test=develop * Temporarily disable int8 resnet101 vgg16 and vgg19 tests test=develop * Add libmkldnn.so.1 to python setup test=develop * Add libmkldnn.so.1 to inference_lib cmake after rebase test=develop * Post rebase fixes + FC int8 changes test=develop * Fix LRN NHWC test=develop * Fix NHWC conv3d test=develop * Windows build fix + next conv3d fix test=develop * Fix conv2d on AVX2 machines test=develop	5 years ago
Zeng Jinle	97e76cb96d	refine dev_ctx.Wait() exception throw, test=develop (#21600 )	5 years ago
Jacek Czaja	cd43c4440e	[MKL-DNN] LRN and Pool2d (FWD) NHWC support (#21375 )	5 years ago
liuwei1031	d8b6cf2bcd	fix sporadically hang issue on windows(#21201 ) cudaStreamSynchronize randomly hang when used in multi-thread environment, replace it with cudaStreamQuery API on windows	5 years ago
zhaoyuchen2018	b93870e696	Improve topk performance. (#21087 ) * Improve topk performance. give 200000 data to compute topk, before opt: cost 1s after opt: cost 0.0028s. * Refine return value. * Add cuda util funtions. * Fix ComputeBlockSize bug & refine comments. Signed-off-by: zhaoyuchen <zhaoyuchen01@baidu.com>	5 years ago
Zeng Jinle	37f76407b0	fix cuda dev_ctx allocator cmake deps, test=develop (#19953 )	5 years ago
Zeng Jinle	c7f36e7c00	Add lock to cudnn handle calls (#19845 ) * refine reallocate of workspace size, test=develop * add lock to cudnn handle calls, test=develop	5 years ago
Huihuang Zheng	12542320c5	Replace TemporaryAllocator by CUDADeviceContextAllocator (#18989 ) TemporaryAllocator is a singleton used for allocating memory for Cudnn. Since it is a singleton, we can delete it for better performance in memory. We replace TemporaryAllocator by CUDADeviceContextAllocator and CUDADeviceContextAllocation, which uses stream callback to delete the memory allocated for the stream to avoid singleton. Also added data_feed_proto to operator to fix CI in CPU compilation	6 years ago
gongweibao	29d8781240	Polish fleet API to support cuda collective mode and nccl2 mode. (#18966 ) Polish fleet API to support cuda collective mode and nccl2 mode	6 years ago
Tao Luo	076f833110	add config.SetMkldnnCacheCapacity api for mkldnn cache clear strategy (#18580 ) * add config.SetMkldnnCacheCapacity api for mkldnn cache clear strategy test=develop * enhance MkldnnPostReset test=develop * add comments for mkldnn_cache_capacity field test=develop	6 years ago
Tao Luo	fe32879d2a	add mkldnn shapeblob cache clear strategy (#18513 ) * add mkldnn shapeblob cache clear strategy test=develop * refine with comments test=develop * make cache clear strategy more safey test=develop * add lock for GetShapeBlobSize test=develop	6 years ago
Tao Luo	3f3112ceb0	add shape_blob for cache mkldnn primitive (#18454 ) test=develop	6 years ago
Leo Zhao	8f5fffca0a	rename mkldnn set/get_cur_thread_id() to set/get_cur_mkldnn_session_id() (#18453 ) * rename mkldnn set/get_cur_thread_id() to set/get_cur_mkldnn_session_id() test=develop * update session id definition and adjust logic for default behavior test=develop * reset logic in mkldnn reuse as most of cases work in default. test=develop	6 years ago
Michał Gallus	8409693272	Reset DeviceContext after quantization warmup (#18182 ) test=develop	6 years ago

1 2 3

133 Commits (32211fe9c4c22168dfb73f19763b17ac9191341a)