Flux helm controller crash causes HR state issues
Symptoms of the issue are HRs showing this error: Helm upgrade failed: another operation (install/upgrade/rollback) is in progress
(sometimes display of the HR will show max retries exhausted
but a describe will show this as the root cause).
In our experience this seems to consistently be tied to the Flux helm controller crashing.
- Originally some of these issues were due to OOM crashes but we bumped up the limits and have still seen issues.
- Currently the crashes/issues seem related to the k8s API being overwhelmed by new resources/changes.
The crash results in a helm release secret getting in a bad state and Flux being unable to handle the reconciliation.
There is a workaround to delete the offending secret as follows in this example:
# example w/ kiali
HR_NAME=kiali
kubectl get secrets -n bigbang | grep ${HR_NAME}
# example output:
# some hr names are duplicated w/ a dash in the middle, some are not
sh.helm.release.v1.kiali-kiali.v1 helm.sh/release.v1 1 18h
sh.helm.release.v1.kiali-kiali.v2 helm.sh/release.v1 1 17h
sh.helm.release.v1.kiali-kiali.v3 helm.sh/release.v1 1 17m
# Delete the most recent one:
kubectl delete secret -n bigbang sh.helm.release.v1.${HR_NAME}-${HR_NAME}.v3
# suspend/resume the hr
flux suspend hr -n bigbang ${HR_NAME}
flux resume hr -n bigbang ${HR_NAME}
Other workaround that should work (not tested by BB team):
HR_NAME=kiali
# Run a helm history command to get the latest release before the issue (should show deployed)
helm history ${HR_NAME} -n bigbang
# Use that revision in this command
helm rollback ${HR_NAME} <revision> -n bigbang
flux reconcile hr bigbang -n bigbang
This is being tracked as an upstream Flux issue here. Maintainers recently noted a rework to the helm controller reconciliation code that may help with this issue (ref).