Duplicate imagePullSecrets prevents Grafana from rescheduling after node failure
Problem Description
On kubernetes 1.26.x (confirmed to exist on 1.26.3 and 1.26.4), in the event of a node failure, Grafana cannot be rescheduled and remains scheduled on a non-existent node because there are duplicate entries for imagePullSecrets
due to the way garbage collection works (likely a bug).
Steps to reproduce:
- Start a cluster on 1.26.3 or 1.26.4
- Deploy Monitoring using the default values for the pull secrets
- Once fully deployed, terminate node hosting Grafana in cloud provider to simulate node failure
- Wait and see that all other pods but Grafana get rescheduled
Additional Info
Relevant logs from kube-controller-manager
:
I0517 19:05:06.656079 1 gc_controller.go:329] "PodGC is force deleting Pod" pod="monitoring/monitoring-monitoring-grafana-754f6766cf-2ht2r"
E0517 19:05:06.666220 1 gc_controller.go:255] failed to create manager for existing fields: failed to convert new object (monitoring/monitoring-monitoring-grafana-754f6766cf-2ht2r; /v1, Kind=Pod) to smd typed: .spec.imagePullSecrets: duplicate entries for key [name="private-registry"]
imagePullSecrets
from pod spec:
Similar issue in Gitlab's repo: gitlab#186 (closed)
Recommended solution:
kube-state-metrics
and grafana
use .Values.global.imagePullSecrets
, which are already defined in the values.
We can remove: