Fix NCCLBcast hang up bug in Parallel Executor (#11377)
* 1. Create buddy allocator in each places before NcclBcast the variables 2. Check the memory usage of ALL gpus rather than the first one * 1. Make NCCLGroupGuard guards only the ncclBcast part, which avoid ncclGroupEnd blocking the exception throwing 2. NOTE the usage of NCCLGroupGuard * Remove the memory usage check of gpus * Fix code stylewangkuiyi-patch-1
parent
cbaa24f597
commit
046bb5c8cb
Loading…
Reference in new issue