fix: adjust flux values to perform 3 retries on failed install
General MR
Adjusts install.remediation.retries
for HelmReleases
so that flux does not perform a helm uninstall
of releases that are in a failed state. This is preventing get_log_dump
from capturing logs from failed pods during pipeline runs where DEBUG
is enabled (#371).
Summary
Changes default settings for how flux handles failed installs of HelmReleases
. Adjusts the number of retries
on install
from -1
-> 3
. This ensures that pods that belong to the failed HelmRelease
are not deleted. Pod logs have intermittently been excluded from DEBUG
artifacts due to flux still performing retries
against the failed release when get_log_dump
runs.
A couple of important callouts:
- A retry is
helm uninstall
->helm install
against the failed release. If the release still fails after the last retry, the release is marked asfailed
in helm andStalled
andRetriesExceeded
in flux. - I am only changing the values in
test-values.yaml
in order to preserve the original intention behind switching to infinite retries. - This MR removes
install.timeout
fromtest-values.yaml
, causinginstall.timeout
to default to the10m
in BigBang values. I am open to discussion here, but the current timeout of60m
on install prevents flux from performing any retries. Bumping down to10m
ensures at least3
retries are performed within the3600s
loop inwait_all_hr
. - This change is being made in tandem with this MR to update the
wait_all_hr
function to exit whenRetriesExceeded
orInstallFailed
reasons are encountered forHelmReleases
, shortening the dev loop.
I'm of the opinion that if a HelmRelease
is failing to deploy in the BigBang pipeline, then that should cause the pipeline to fail. As configured today, with retries: -1
, the pipeline will never fail when a HelmRelease
is failing to deploy (see this MR for more context). Rather, we wait for the wait_all_hr
timeout to be reached, which takes an hour. Implementing this change will shorten the dev loop for BigBang pipelines by only retrying failed releases 3 times. Pipelines that take 60m
to fail will now take ~15m
less.
Relevant logs/screenshots
Link to a pipeline run where the vault
HelmRelease
is in a failed state and get_log_dump
does not capture any pod logs due to the failed HelmRelease
being uninstalled. More context is available in this thread.
Link to a pipeline run where the suggested change causes the pipeline to fail when flux has exhausted all retries against the vault
HelmRelease
. Note the elapsed time of the pipeline of 48m
instead of 65m
and failing pods are preserved.
Linked Issue
big-bang/pipeline-templates/pipeline-templates#371 (closed)
Depends on: big-bang/pipeline-templates/pipeline-templates!604 (closed)
Upgrade Notices
N/A