fix: adjust flux values to perform 3 retries on failed install
General MR
Adjusts install.remediation.retries for HelmReleases so that flux does not perform a helm uninstall of releases that are in a failed state. This is preventing get_log_dump from capturing logs from failed pods during pipeline runs where DEBUG is enabled (#371).
Summary
Changes default settings for how flux handles failed installs of HelmReleases. Adjusts the number of retries on install from -1 -> 3. This ensures that pods that belong to the failed HelmRelease are not deleted. Pod logs have intermittently been excluded from DEBUG artifacts due to flux still performing retries against the failed release when get_log_dump runs.
A couple of important callouts:
- A retry is
helm uninstall->helm installagainst the failed release. If the release still fails after the last retry, the release is marked asfailedin helm andStalledandRetriesExceededin flux. - I am only changing the values in
test-values.yamlin order to preserve the original intention behind switching to infinite retries. - This MR removes
install.timeoutfromtest-values.yaml, causinginstall.timeoutto default to the10min BigBang values. I am open to discussion here, but the current timeout of60mon install prevents flux from performing any retries. Bumping down to10mensures at least3retries are performed within the3600sloop inwait_all_hr. - This change is being made in tandem with this MR to update the
wait_all_hrfunction to exit whenRetriesExceededorInstallFailedreasons are encountered forHelmReleases, shortening the dev loop.
I'm of the opinion that if a HelmRelease is failing to deploy in the BigBang pipeline, then that should cause the pipeline to fail. As configured today, with retries: -1, the pipeline will never fail when a HelmRelease is failing to deploy (see this MR for more context). Rather, we wait for the wait_all_hr timeout to be reached, which takes an hour. Implementing this change will shorten the dev loop for BigBang pipelines by only retrying failed releases 3 times. Pipelines that take 60m to fail will now take ~15m less.
Relevant logs/screenshots
Link to a pipeline run where the vault HelmRelease is in a failed state and get_log_dump does not capture any pod logs due to the failed HelmRelease being uninstalled. More context is available in this thread.
Link to a pipeline run where the suggested change causes the pipeline to fail when flux has exhausted all retries against the vault HelmRelease. Note the elapsed time of the pipeline of 48m instead of 65m and failing pods are preserved.
Linked Issue
big-bang/pipeline-templates/pipeline-templates#371 (closed)
Depends on: big-bang/pipeline-templates/pipeline-templates!604 (closed)
Upgrade Notices
N/A