Flux helm controller crash causes HR state issues

Symptoms of the issue are HRs showing this error: Helm upgrade failed: another operation (install/upgrade/rollback) is in progress (sometimes display of the HR will show max retries exhausted but a describe will show this as the root cause).

In our experience this seems to consistently be tied to the Flux helm controller crashing.

Originally some of these issues were due to OOM crashes but we bumped up the limits and have still seen issues.
Currently the crashes/issues seem related to the k8s API being overwhelmed by new resources/changes.

The crash results in a helm release secret getting in a bad state and Flux being unable to handle the reconciliation.

There is a workaround to delete the offending secret as follows in this example:

# example w/ kiali
HR_NAME=kiali
kubectl get secrets -n bigbang | grep ${HR_NAME}
# example output:
# some hr names are duplicated w/ a dash in the middle, some are not
sh.helm.release.v1.kiali-kiali.v1                                       helm.sh/release.v1                    1      18h
sh.helm.release.v1.kiali-kiali.v2                                       helm.sh/release.v1                    1      17h
sh.helm.release.v1.kiali-kiali.v3                                       helm.sh/release.v1                    1      17m
# Delete the most recent one:
kubectl delete secret -n bigbang sh.helm.release.v1.${HR_NAME}-${HR_NAME}.v3

# suspend/resume the hr
flux suspend hr -n bigbang ${HR_NAME}
flux resume hr -n bigbang ${HR_NAME}

Other workaround that should work (not tested by BB team):

HR_NAME=kiali
# Run a helm history command to get the latest release before the issue (should show deployed)
helm history ${HR_NAME} -n bigbang
# Use that revision in this command
helm rollback ${HR_NAME} <revision> -n bigbang
flux reconcile hr bigbang -n bigbang

This is being tracked as an upstream Flux issue here. Maintainers recently noted a rework to the helm controller reconciliation code that may help with this issue (ref).

Edited Jun 10, 2022 by Micah Nagel

Admin message

Flux helm controller crash causes HR state issues