nvidia gpu-operator image onboarding (internal request)

(the internal request ticket link seems broken, btw)

Greetings, PB has a customer utilizing GPUs for cluster workloads. In order to facilitate this, we’ve stood up GPU nodes as well as deploying the NVIDIA gpu-operator to the cluster. We’d like to get the requisite images onboarded into IB. I’ve done just about the most barebones deployment of the operator possible, image details below:

nvcr.io/nvidia/gpu-operator:v24.3.0 this is the operator itself. upstream Dockerfile here

nvcr.io/nvidia/cloud-native/gpu-operator-validator:v24.3.0

I think this is basically a validation init container, seems to be used in almost every pod. Dockerfile here

nvcr.io/nvidia/k8s/dcgm-exporter:3.3.5-3.4.1-ubuntu22.04

gpu metrics exporter, upstream Dockerfile here

nvcr.io/nvidia/k8s-device-plugin:v0.15.0-ubi8

NVIDIA device plugin, required to actually schedule gpu time. Dockerfile here

registry.k8s.io/nfd/node-feature-discovery:v0.15.4

This just the generic k8 nfd image, I did not find an existing one in IB, but if there is one, feel free to point my in the direction. Dockerfile here

and then, of course, all the NVIDIA containers appear to be build on top of the base cuda image, which, as far as I can tell, is built here: https://gitlab.com/nvidia/container-images/cuda/-/tree/master?ref_type=heads

And there IS an existing cuda image in IB, but it’s super unclear to me if that one is the same underlying code as the above.

Feel free to hit me up on mattermost with additional questions, I’m happy to help out where I can. Thanks!

Edited May 28, 2024 by W. Scott Rogers

Admin message

nvidia gpu-operator image onboarding (internal request)