bug: nvidia gpu operator components require root permissions
Summary
We are running the NVIDIA GPU Operator on bare metal using an RKE2 cluster. The NVIDIA GPU Operator, and all of its Registry1-flavored components, do not work due to permissions errors within the initContainers
of each component. This is due to the Docker build used by these repositories downgrading the container user to nvidia
without regard to upstream documentation on the host-level access required to run these components as indicated here.
Steps to reproduce
The example setup repo for our RKE2 cluster is here: https://github.com/justinthelaw/uds-rke2/tree/22-nvidia-gpu-operator-optional-bootstrapping-package. See the packages/nvidia-gpu-operator
for details on a successful deployment.
Assumption is that node-feature-discovery is already installed within the node.
Install the NVIDIA GPU Operator into an existing cluster with a values file similar to this:
Click to expand
# Default values for gpu-operator.
# This is a YAML-formatted file.
# Declare variables to be passed into your templates.
platform:
openshift: false
nfd:
# usually enabled by default, but choose to use external NFD from IronBank
enabled: false
nodefeaturerules: false
psa:
enabled: false
cdi:
enabled: false
default: false
sandboxWorkloads:
enabled: false
defaultWorkload: "container"
hostPaths:
# rootFS represents the path to the root filesystem of the host.
# This is used by components that need to interact with the host filesystem
# and as such this must be a chroot-able filesystem.
# Examples include the MIG Manager and Toolkit Container which may need to
# stop, start, or restart systemd services
rootFS: "/"
# driverInstallDir represents the root at which driver files including libraries,
# config files, and executables can be found.
driverInstallDir: "/run/nvidia/driver"
daemonsets:
labels: {}
annotations: {}
priorityClassName: system-node-critical
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
# configuration for controlling update strategy("OnDelete" or "RollingUpdate") of GPU Operands
# note that driver Daemonset is always set with OnDelete to avoid unintended disruptions
updateStrategy: "RollingUpdate"
# configuration for controlling rolling update of GPU Operands
rollingUpdate:
# maximum number of nodes to simultaneously apply pod updates on.
# can be specified either as number or percentage of nodes. Default 1.
maxUnavailable: "1"
validator:
repository: registry1.dso.mil/ironbank/opensource/nvidia
image: gpu-operator-validator
version: v24.3.0
imagePullPolicy: IfNotPresent
imagePullSecrets: []
env: []
args: []
resources: {}
plugin:
env:
- name: WITH_WORKLOAD
value: "false"
driver:
# RKE2-specific configurations
env:
- name: DISABLE_DEV_CHAR_SYMLINK_CREATION
value: "true"
- name: NVIDIA_VISIBLE_DEVICES
value: all
# Default value of "all" causes the "display" capability to also be considered;
# however, not all hosts have or allow that capability, causing the daemonset to fail
- name: NVIDIA_DRIVER_CAPABILITIES
value: compute,utility
operator:
repository: registry1.dso.mil/ironbank/opensource/nvidia
image: gpu-operator
version: v24.3.0
imagePullPolicy: IfNotPresent
imagePullSecrets: []
priorityClassName: system-node-critical
# Explicitly set runtime to containerd, not the default of `docker`
defaultRuntime: containerd
runtimeClass: nvidia
use_ocp_driver_toolkit: false
# cleanup CRD on chart un-install
cleanupCRD: false
# upgrade CRD on chart upgrade, requires --disable-openapi-validation flag
# to be passed during helm upgrade.
upgradeCRD: false
initContainer:
image: cuda
repository: registry1.dso.mil/ironbank/opensource/nvidia
version: 12.4
imagePullPolicy: IfNotPresent
tolerations:
- key: "node-role.kubernetes.io/master"
operator: "Equal"
value: ""
effect: "NoSchedule"
- key: "node-role.kubernetes.io/control-plane"
operator: "Equal"
value: ""
effect: "NoSchedule"
annotations:
openshift.io/scc: restricted-readonly
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
preference:
matchExpressions:
- key: "node-role.kubernetes.io/master"
operator: In
values: [""]
- weight: 1
preference:
matchExpressions:
- key: "node-role.kubernetes.io/control-plane"
operator: In
values: [""]
logging:
# Zap time encoding (one of 'epoch', 'millis', 'nano', 'iso8601', 'rfc3339' or 'rfc3339nano')
timeEncoding: epoch
# Zap Level to configure the verbosity of logging. Can be one of 'debug', 'info', 'error', or any integer value > 0 which corresponds to custom debug levels of increasing verbosity
level: info
# Development Mode defaults(encoder=consoleEncoder,logLevel=Debug,stackTraceLevel=Warn)
# Production Mode defaults(encoder=jsonEncoder,logLevel=Info,stackTraceLevel=Error)
develMode: true
resources:
limits:
cpu: 500m
memory: 350Mi
requests:
cpu: 200m
memory: 100Mi
mig:
strategy: single
driver:
# usually enabled by default, depends on deployment environment
enabled: false
nvidiaDriverCRD:
enabled: false
deployDefaultCR: true
driverType: gpu
nodeSelector: {}
useOpenKernelModules: false
# use pre-compiled packages for NVIDIA driver installation.
# only supported for as a tech-preview feature on ubuntu22.04 kernels.
# there is no IronBank flavor for these containers
usePrecompiled: false
repository: nvcr.io/nvidia
image: driver
version: "550.90.07"
imagePullPolicy: IfNotPresent
imagePullSecrets: []
startupProbe:
initialDelaySeconds: 60
periodSeconds: 10
# nvidia-smi can take longer than 30s in some cases
# ensure enough timeout is set
timeoutSeconds: 60
failureThreshold: 120
rdma:
enabled: false
useHostMofed: false
upgradePolicy:
# global switch for automatic upgrade feature
# if set to false all other options are ignored
autoUpgrade: true
# how many nodes can be upgraded in parallel
# 0 means no limit, all nodes will be upgraded in parallel
maxParallelUpgrades: 1
# maximum number of nodes with the driver installed, that can be unavailable during
# the upgrade. Value can be an absolute number (ex: 5) or
# a percentage of total nodes at the start of upgrade (ex:
# 10%). Absolute number is calculated from percentage by rounding
# up. By default, a fixed value of 25% is used.'
maxUnavailable: 25%
# options for waiting on pod(job) completions
waitForCompletion:
timeoutSeconds: 0
podSelector: ""
# options for gpu pod deletion
gpuPodDeletion:
force: false
timeoutSeconds: 300
deleteEmptyDir: false
# options for node drain (`kubectl drain`) before the driver reload
# this is required only if default GPU pod deletions done by the operator
# are not sufficient to re-install the driver
drain:
enable: false
force: false
podSelector: ""
# It's recommended to set a timeout to avoid infinite drain in case non-fatal error keeps happening on retries
timeoutSeconds: 300
deleteEmptyDir: false
manager:
image: k8s-driver-manager
repository: nvcr.io/nvidia/cloud-native
# When choosing a different version of k8s-driver-manager, DO NOT downgrade to a version lower than v0.6.4
# to ensure k8s-driver-manager stays compatible with gpu-operator starting from v24.3.0
version: v0.6.9
imagePullPolicy: IfNotPresent
env:
- name: ENABLE_GPU_POD_EVICTION
value: "true"
- name: ENABLE_AUTO_DRAIN
value: "false"
- name: DRAIN_USE_FORCE
value: "false"
- name: DRAIN_POD_SELECTOR_LABEL
value: ""
- name: DRAIN_TIMEOUT_SECONDS
value: "0s"
- name: DRAIN_DELETE_EMPTYDIR_DATA
value: "false"
env: []
resources: {}
# Private mirror repository configuration
repoConfig:
configMapName: ""
# custom ssl key/certificate configuration
certConfig:
name: ""
# vGPU licensing configuration
licensingConfig:
configMapName: ""
nlsEnabled: true
# vGPU topology daemon configuration
virtualTopology:
config: ""
# kernel module configuration for NVIDIA driver
kernelModuleConfig:
name: ""
toolkit:
# usually enabled by default, depends on deployment environment
enabled: false
# there is no IronBank flavor for these containers
repository: nvcr.io/nvidia/k8s
image: container-toolkit
version: v1.16.0-rc.1-ubuntu20.04
imagePullPolicy: IfNotPresent
imagePullSecrets: []
env:
# RKE2-specific configurations
- name: CONTAINERD_CONFIG
value: /var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl
- name: CONTAINERD_SOCKET
value: /run/k3s/containerd/containerd.sock
- name: CONTAINERD_RUNTIME_CLASS
value: nvidia
- name: CONTAINERD_SET_AS_DEFAULT
value: "true"
resources: {}
installDir: "/usr/local/nvidia"
devicePlugin:
enabled: true
repository: registry1.dso.mil/ironbank/opensource/nvidia
image: k8s-device-plugin
version: v0.15.1-ubi8
imagePullPolicy: IfNotPresent
imagePullSecrets: []
args: []
env:
- name: PASS_DEVICE_SPECS
value: "true"
- name: FAIL_ON_INIT_ERROR
value: "true"
- name: DEVICE_LIST_STRATEGY
value: envvar
- name: DEVICE_ID_STRATEGY
value: uuid
- name: NVIDIA_VISIBLE_DEVICES
value: all
# Default value of "all" causes the "display" capability to also be considered;
# however, not all hosts have or allow that capability, causing the daemonset to fail
- name: NVIDIA_DRIVER_CAPABILITIES
value: compute,utility
resources: {}
# Plugin configuration
# Use "name" to either point to an existing ConfigMap or to create a new one with a list of configurations(i.e with create=true).
# Use "data" to build an integrated ConfigMap from a set of configurations as
# part of this helm chart. An example of setting "data" might be:
# config:
# name: device-plugin-config
# create: true
# data:
# default: |-
# version: v1
# flags:
# migStrategy: none
# mig-single: |-
# version: v1
# flags:
# migStrategy: single
# mig-mixed: |-
# version: v1
# flags:
# migStrategy: mixed
config:
# Create a ConfigMap (default: false)
create: false
# ConfigMap name (either exiting or to create a new one with create=true above)
name: ""
# Default config name within the ConfigMap
default: ""
# Data section for the ConfigMap to create (i.e only applies when create=true)
data: {}
# MPS related configuration for the plugin
mps:
# MPS root path on the host
root: "/run/nvidia/mps"
# standalone dcgm host engine
dcgm:
# disabled by default to use embedded nv-host engine by exporter
enabled: false
repository: nvcr.io/nvidia/cloud-native
image: dcgm
version: 3.3.6-1-ubuntu22.04
imagePullPolicy: IfNotPresent
args: []
env: []
resources: {}
dcgmExporter:
# TODO: re-enable and integrate with Prometheus
# disabled due to Registry1 image issues
enabled: false
repository: registry1.dso.mil/ironbank/opensource/nvidia
image: dcgm-exporter
version: 3.3.6-3.4.2
imagePullPolicy: IfNotPresent
env:
- name: DCGM_EXPORTER_LISTEN
value: ":9400"
- name: DCGM_EXPORTER_KUBERNETES
value: "true"
- name: DCGM_EXPORTER_COLLECTORS
value: "/etc/dcgm-exporter/dcp-metrics-included.csv"
resources: {}
serviceMonitor:
enabled: false
interval: 15s
honorLabels: false
additionalLabels: {}
relabelings: []
# - source_labels:
# - __meta_kubernetes_pod_node_name
# regex: (.*)
# target_label: instance
# replacement: $1
# action: replace
gfd:
enabled: true
repository: registry1.dso.mil/ironbank/opensource/nvidia
image: k8s-device-plugin
version: v0.15.1-ubi8
imagePullPolicy: IfNotPresent
imagePullSecrets: []
env:
- name: GFD_SLEEP_INTERVAL
value: 60s
- name: GFD_FAIL_ON_INIT_ERROR
value: "true"
resources: {}
migManager:
# usually enabled by default, depends on deployment environment
enabled: false
# there is no IronBank flavor for these containers
repository: nvcr.io/nvidia/cloud-native
image: k8s-mig-manager
version: v0.8.0-rc.1-ubuntu20.04
imagePullPolicy: IfNotPresent
imagePullSecrets: []
env:
- name: WITH_REBOOT
value: "false"
resources: {}
config:
name: "default-mig-parted-config"
default: "all-disabled"
gpuClientsConfig:
name: ""
nodeStatusExporter:
enabled: false
repository: nvcr.io/nvidia/cloud-native
image: gpu-operator-validator
# If version is not specified, then default is to use chart.AppVersion
#version: ""
imagePullPolicy: IfNotPresent
imagePullSecrets: []
resources: {}
gds:
enabled: false
repository: nvcr.io/nvidia/cloud-native
image: nvidia-fs
version: "2.17.5"
imagePullPolicy: IfNotPresent
imagePullSecrets: []
env: []
args: []
gdrcopy:
enabled: false
repository: nvcr.io/nvidia/cloud-native
image: gdrdrv
version: "v2.4.1"
imagePullPolicy: IfNotPresent
imagePullSecrets: []
env: []
args: []
vgpuManager:
enabled: false
repository: ""
image: vgpu-manager
version: ""
imagePullPolicy: IfNotPresent
imagePullSecrets: []
env: []
resources: {}
driverManager:
image: k8s-driver-manager
repository: nvcr.io/nvidia/cloud-native
# When choosing a different version of k8s-driver-manager, DO NOT downgrade to a version lower than v0.6.4
# to ensure k8s-driver-manager stays compatible with gpu-operator starting from v24.3.0
version: v0.6.9
imagePullPolicy: IfNotPresent
env:
- name: ENABLE_GPU_POD_EVICTION
value: "false"
- name: ENABLE_AUTO_DRAIN
value: "false"
vgpuDeviceManager:
# usually enabled by default, depends on deployment environment
enabled: false
# there is no IronBank flavor for these containers
repository: nvcr.io/nvidia/cloud-native
image: vgpu-device-manager
version: "v0.2.6"
imagePullPolicy: IfNotPresent
imagePullSecrets: []
env: []
config:
name: ""
default: "default"
vfioManager:
# usually enabled by default, depends on deployment environment
enabled: false
repository: nvcr.io/nvidia
image: cuda
version: 12.5.0-base-ubi8
imagePullPolicy: IfNotPresent
imagePullSecrets: []
env: []
resources: {}
driverManager:
# there is no IronBank flavor for these containers
image: k8s-driver-manager
repository: nvcr.io/nvidia/cloud-native
# When choosing a different version of k8s-driver-manager, DO NOT downgrade to a version lower than v0.6.4
# to ensure k8s-driver-manager stays compatible with gpu-operator starting from v24.3.0
version: v0.6.9
imagePullPolicy: IfNotPresent
env:
- name: ENABLE_GPU_POD_EVICTION
value: "false"
- name: ENABLE_AUTO_DRAIN
value: "false"
kataManager:
enabled: false
config:
artifactsDir: "/opt/nvidia-gpu-operator/artifacts/runtimeclasses"
runtimeClasses:
- name: kata-nvidia-gpu
nodeSelector: {}
artifacts:
url: nvcr.io/nvidia/cloud-native/kata-gpu-artifacts:ubuntu22.04-535.54.03
pullSecret: ""
- name: kata-nvidia-gpu-snp
nodeSelector:
"nvidia.com/cc.capable": "true"
artifacts:
url: nvcr.io/nvidia/cloud-native/kata-gpu-artifacts:ubuntu22.04-535.86.10-snp
pullSecret: ""
repository: nvcr.io/nvidia/cloud-native
image: k8s-kata-manager
version: v0.2.0
imagePullPolicy: IfNotPresent
imagePullSecrets: []
env: []
resources: {}
sandboxDevicePlugin:
# usually enabled by default, depends on deployment environment
enabled: false
# there is no IronBank flavor for these containers
repository: nvcr.io/nvidia
image: kubevirt-gpu-device-plugin
version: v1.2.8
imagePullPolicy: IfNotPresent
imagePullSecrets: []
args: []
env: []
resources: {}
ccManager:
enabled: false
defaultMode: "off"
repository: nvcr.io/nvidia/cloud-native
image: k8s-cc-manager
version: v0.1.1
imagePullPolicy: IfNotPresent
imagePullSecrets: []
env:
- name: CC_CAPABLE_DEVICE_IDS
value: "0x2339,0x2331,0x2330,0x2324,0x2322,0x233d"
resources: {}
node-feature-discovery:
enableNodeFeatureApi: true
gc:
enable: true
replicaCount: 1
serviceAccount:
name: node-feature-discovery
create: false
worker:
serviceAccount:
name: node-feature-discovery
# disable creation to avoid duplicate serviceaccount creation by master spec below
create: false
tolerations:
- key: "node-role.kubernetes.io/master"
operator: "Equal"
value: ""
effect: "NoSchedule"
- key: "node-role.kubernetes.io/control-plane"
operator: "Equal"
value: ""
effect: "NoSchedule"
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
config:
sources:
pci:
deviceClassWhitelist:
- "02"
- "0200"
- "0207"
- "0300"
- "0302"
deviceLabelFields:
- vendor
master:
serviceAccount:
name: node-feature-discovery
create: true
config:
extraLabelNs: ["nvidia.com"]
# noPublish: false
# resourceLabels: ["nvidia.com/feature-1","nvidia.com/feature-2"]
# enableTaints: false
# labelWhiteList: "nvidia.com/gpu"
What is the current bug behavior?
The NVIDIA GPU operator fails to start due to permission denied
errors throughout ALL containers and pods.
For example, the NVIDIA GPU Operator validator has the entire host filesystem mounted at /host in its container.
The validator containers are confirming that the NVIDIA container toolkit and NVIDIA drivers exist and are compatible with each other and the GPU operator's runtimes, performing a test CUDA workload on the host, and collecting some metrics about the host's mounted GPU(s) for the GPU operator to consume. All of these operations require direct RWX on the host node's filesystem (mounted at /host).
We could attempt to modify the host filesystem or groups to provide access to the nvidia user to those parts of the filesystem. However, I think the first chroot it attempts to perform happens at the root of the entire node's filesystem.
We'd need to reconfirm where/what all the host passthrough operations actually RWX on in order to ensure that we create all the right users and groups for proper/granular access.
What is the expected correct behavior?
All pods and containers are in a running/ready state and/or complete without errors.
This is achieved by running kubectl patches that fix runAsUser: 0
in the initContainer securityContexts.
The example setup repo for our RKE2 cluster is here: https://github.com/justinthelaw/uds-rke2/tree/22-nvidia-gpu-operator-optional-bootstrapping-package. See the packages/nvidia-gpu-operator
for details on a successful deployment.
Relevant logs and/or screenshots
Without the kubectl patch commands, there are multiple red-herring errors across 9 containers. These errors are fixed as soon as the securityContext is fixed as mentioned in the previous section.
Possible fixes
Remove each of the NVIDIA GPU Operator's components' Docker build USER
instructions from nvidia
to root
or 0
.
Tasks
-
Revert all NVIDIA GPU Operator containers back to the root user
Please read the Iron Bank Documentation for more info