nvidia gpu-operator image onboarding (internal request)
(the internal request ticket link seems broken, btw)
Greetings, PB has a customer utilizing GPUs for cluster workloads. In order to facilitate this, we’ve stood up GPU nodes as well as deploying the NVIDIA gpu-operator to the cluster. We’d like to get the requisite images onboarded into IB. I’ve done just about the most barebones deployment of the operator possible, image details below:
nvcr.io/nvidia/gpu-operator:v24.3.0 this is the operator itself. upstream Dockerfile here
nvcr.io/nvidia/cloud-native/gpu-operator-validator:v24.3.0
I think this is basically a validation init container, seems to be used in almost every pod. Dockerfile here
nvcr.io/nvidia/k8s/dcgm-exporter:3.3.5-3.4.1-ubuntu22.04
gpu metrics exporter, upstream Dockerfile here
nvcr.io/nvidia/k8s-device-plugin:v0.15.0-ubi8
NVIDIA device plugin, required to actually schedule gpu time. Dockerfile here
registry.k8s.io/nfd/node-feature-discovery:v0.15.4
This just the generic k8 nfd image, I did not find an existing one in IB, but if there is one, feel free to point my in the direction. Dockerfile here
and then, of course, all the NVIDIA containers appear to be build on top of the base cuda image, which, as far as I can tell, is built here: https://gitlab.com/nvidia/container-images/cuda/-/tree/master?ref_type=heads
And there IS an existing cuda image in IB, but it’s super unclear to me if that one is the same underlying code as the above.
Feel free to hit me up on mattermost with additional questions, I’m happy to help out where I can. Thanks!