UNCLASSIFIED - NO CUI

Skip to content

SPIKE: create an epic for applying a PriorityClass to critical components

Summary

While updating a Bigbang deployment we noticed the monitoring release was failing to reconcile on repeated attempts with a timeout error.

The underlying cause of the issue was that prometheus-node-exporter pods were unable to deploy to all nodes due to insufficient resources. This was indicated in the failed helm release:

      DaemonSet is not ready: monitoring/monitoring-monitoring-prometheus-node-exporter. 0 out of 10 expected pods have been scheduled
      DaemonSet is not ready: monitoring/monitoring-monitoring-prometheus-node-exporter. 1 out of 10 expected pods have been scheduled
      DaemonSet is not ready: monitoring/monitoring-monitoring-prometheus-node-exporter. 2 out of 10 expected pods have been scheduled
      warning: Upgrade "monitoring-monitoring" failed: timed out waiting for the condition
    reason: UpgradeFailed

As well as the pod and events:

Warning  FailedScheduling   39s (x6 over 2m18s)  default-scheduler   0/10 nodes are available: 1 Insufficient cpu, 9 node(s) didn't match Pod's node affinity/selector

Temporary resolution

The cluster we were deploying into happened to have a pre-defined PolicyClass. We manually added this to the daemonset's pod spec so that the pods were deployed and the release was able to reconcile as expected.

Notional feature request

There are several daemonsets deployed by Bigbang which I would imagine could run into similar scenarios. Promtail, twistlock defenders, velero/restic all come to mind. Bigbang could define a PolicyClass (perhaps as part of /base) and add it to daemonsets and other critical components as appropriate to ensure that reconciliation of BB managed helm releases is able to complete without hanging on resource scheduling constraints.