Duplicate imagePullSecrets prevents redis from rescheduling after node failure
Problem Description
On kubernetes 1.26.x (confirmed to exist on 1.26.3 and 1.26.4), in the event of a node failure, redis cannot be rescheduled and remains scheduled on a non-existent node because there are duplicate entries for imagePullSecrets
due to the way garbage collection works (likely a bug).
Steps to reproduce:
- Start a cluster on 1.26.3 or 1.26.4
- Deploy Gitlab using the default values for the pull secrets
- Once fully deployed, terminate node hosting redis in cloud provider to simulate node failure
- Wait and see that all other pods but redis get rescheduled
Additional Info
Relevant logs from kube-controller-manager
:
I0508 15:11:02.385261 1 gc_controller.go:329] "PodGC is force deleting Pod" pod="gitlab/gitlab-redis-master-0"
E0508 15:11:02.389990 1 gc_controller.go:255] failed to create manager for existing fields: failed to convert new object (gitlab/gitlab-redis-master-0; /v1, Kind=Pod) to smd typed: .spec.imagePullSecrets: duplicate entries for key [name="private-registry"]
imagePullSecrets
from pod spec:
When digging through the redis chart included in this repo, pullSecrets
to the different components in the chart are simply concatenated together and given to every component.
relevant section from templates/_helpers.tpl
{{- define "redis.imagePullSecrets" -}}
{{/*
Helm 2.11 supports the assignment of a value to a variable defined in a different scope,
but Helm 2.9 and 2.10 does not support it, so we need to implement this if-else logic.
Also, we can not use a single if because lazy evaluation is not an option
*/}}
{{- if .Values.global }}
{{- if .Values.global.imagePullSecrets }}
imagePullSecrets:
{{- range .Values.global.imagePullSecrets }}
- name: {{ . }}
{{- end }}
{{- else if or .Values.image.pullSecrets .Values.metrics.image.pullSecrets .Values.sysctlImage.pullSecrets .Values.volumePermissions.image.pullSecrets }}
imagePullSecrets:
{{- range .Values.image.pullSecrets }}
- name: {{ . }}
{{- end }}
{{- range .Values.metrics.image.pullSecrets }}
- name: {{ . }}
{{- end }}
{{- range .Values.sysctlImage.pullSecrets }}
- name: {{ . }}
{{- end }}
{{- range .Values.volumePermissions.image.pullSecrets }}
- name: {{ . }}
{{- end }}
{{- end -}}
{{- else if or .Values.image.pullSecrets .Values.metrics.image.pullSecrets .Values.sysctlImage.pullSecrets .Values.volumePermissions.image.pullSecrets }}
imagePullSecrets:
{{- range .Values.image.pullSecrets }}
- name: {{ . }}
{{- end }}
{{- range .Values.metrics.image.pullSecrets }}
- name: {{ . }}
{{- end }}
{{- range .Values.sysctlImage.pullSecrets }}
- name: {{ . }}
{{- end }}
{{- range .Values.volumePermissions.image.pullSecrets }}
- name: {{ . }}
{{- end }}
{{- end -}}
{{- end -}}
Recommended solution:
Consolidate pullSecrets
from .Values.redis.metrics.pullSecrets and .Values.redis.metrics.pullSecrets to .Values.redis.global.pullSecrets
Or
Remove one or the other two listed above sources of pullSecrets