Noah Birrer requested to merge fix/failed-release-pod-logs into master Jun 18, 2024

General MR

Adjusts install.remediation.retries for HelmReleases so that flux does not perform a helm uninstall of releases that are in a failed state. This is preventing get_log_dump from capturing logs from failed pods during pipeline runs where DEBUG is enabled (#371).

Summary

Changes default settings for how flux handles failed installs of HelmReleases. Adjusts the number of retries on install from -1 -> 3. This ensures that pods that belong to the failed HelmRelease are not deleted. Pod logs have intermittently been excluded from DEBUG artifacts due to flux still performing retries against the failed release when get_log_dump runs.

A couple of important callouts:

A retry is helm uninstall -> helm install against the failed release. If the release still fails after the last retry, the release is marked as failed in helm and Stalled and RetriesExceeded in flux.
I am only changing the values in test-values.yaml in order to preserve the original intention behind switching to infinite retries.
This MR removes install.timeout from test-values.yaml, causing install.timeout to default to the 10m in BigBang values. I am open to discussion here, but the current timeout of 60m on install prevents flux from performing any retries. Bumping down to 10m ensures at least 3 retries are performed within the 3600s loop in wait_all_hr.
This change is being made in tandem with this MR to update the wait_all_hr function to exit when RetriesExceeded or InstallFailed reasons are encountered for HelmReleases, shortening the dev loop.

I'm of the opinion that if a HelmRelease is failing to deploy in the BigBang pipeline, then that should cause the pipeline to fail. As configured today, with retries: -1, the pipeline will never fail when a HelmRelease is failing to deploy (see this MR for more context). Rather, we wait for the wait_all_hr timeout to be reached, which takes an hour. Implementing this change will shorten the dev loop for BigBang pipelines by only retrying failed releases 3 times. Pipelines that take 60m to fail will now take ~15m less.

Relevant logs/screenshots

Link to a pipeline run where the vault HelmRelease is in a failed state and get_log_dump does not capture any pod logs due to the failed HelmRelease being uninstalled. More context is available in this thread.

Link to a pipeline run where the suggested change causes the pipeline to fail when flux has exhausted all retries against the vault HelmRelease. Note the elapsed time of the pipeline of 48m instead of 65m and failing pods are preserved.

Linked Issue

big-bang/pipeline-templates/pipeline-templates#371 (closed)

Depends on: big-bang/pipeline-templates/pipeline-templates!604 (closed)

Upgrade Notices

N/A

Edited Jun 20, 2024 by Noah Birrer

Admin message