Retry CUDA Initialization to Fix Random Failure, test=develop (#28323)
This PR is follow up of #28213. On that PR we tried to decrease GPU usage, however the CI still randomly failed. So I added retry logic for the initialization of nccl and cusolver. If the initialization failed, we can retry to avoid the random failure.revert-28284-dev/pybind_version
parent
5262b02585
commit
acc11c2a62
Loading…
Reference in new issue