UNCLASSIFIED - NO CUI

Snippets Groups Projects

Merged Noah Birrer requested to merge fix/failed-release-pod-logs into master 10 months ago

General MR

Adjusts install.remediation.retries for HelmReleases so that flux does not perform a helm uninstall of releases that are in a failed state. This is preventing get_log_dump from capturing logs from failed pods during pipeline runs where DEBUG is enabled (#371).

Summary

Changes default settings for how flux handles failed installs of HelmReleases. Adjusts the number of retries on install from -1 -> 3. This ensures that pods that belong to the failed HelmRelease are not deleted. Pod logs have intermittently been excluded from DEBUG artifacts due to flux still performing retries against the failed release when get_log_dump runs.

A couple of important callouts:

A retry is helm uninstall -> helm install against the failed release. If the release still fails after the last retry, the release is marked as failed in helm and Stalled and RetriesExceeded in flux.
I am only changing the values in test-values.yaml in order to preserve the original intention behind switching to infinite retries.
This MR removes install.timeout from test-values.yaml, causing install.timeout to default to the 10m in BigBang values. I am open to discussion here, but the current timeout of 60m on install prevents flux from performing any retries. Bumping down to 10m ensures at least 3 retries are performed within the 3600s loop in wait_all_hr.
This change is being made in tandem with this MR to update the wait_all_hr function to exit when RetriesExceeded or InstallFailed reasons are encountered for HelmReleases, shortening the dev loop.

I'm of the opinion that if a HelmRelease is failing to deploy in the BigBang pipeline, then that should cause the pipeline to fail. As configured today, with retries: -1, the pipeline will never fail when a HelmRelease is failing to deploy (see this MR for more context). Rather, we wait for the wait_all_hr timeout to be reached, which takes an hour. Implementing this change will shorten the dev loop for BigBang pipelines by only retrying failed releases 3 times. Pipelines that take 60m to fail will now take ~15m less.

Relevant logs/screenshots

Link to a pipeline run where the vault HelmRelease is in a failed state and get_log_dump does not capture any pod logs due to the failed HelmRelease being uninstalled. More context is available in this thread.

Link to a pipeline run where the suggested change causes the pipeline to fail when flux has exhausted all retries against the vault HelmRelease. Note the elapsed time of the pipeline of 48m instead of 65m and failing pods are preserved.

Linked Issue

big-bang/pipeline-templates/pipeline-templates#371 (closed)

Depends on: big-bang/pipeline-templates/pipeline-templates!604 (closed)

Upgrade Notices

N/A

Edited 10 months ago by Noah Birrer

Activity

Noah Birrer assigned to @noahbirrer 10 months ago

assigned to @noahbirrer
Noah Birrer marked this merge request as draft 10 months ago

marked this merge request as draft
Noah Birrer added 19 commits 10 months ago
added 19 commits

4b47e0b1...8bcd0521 - 18 commits from branch master

708c4171 - Merge branch 'master' into fix/failed-release-pod-logs

Compare with previous version
Noah Birrer added 1 commit 10 months ago
added 1 commit

0634f2a1 - add back interval [ci skip]

Compare with previous version
Noah Birrer mentioned in merge request big-bang/pipeline-templates/pipeline-templates!604 (closed) 10 months ago

mentioned in merge request big-bang/pipeline-templates/pipeline-templates!604 (closed)
Noah Birrer changed the description 10 months ago

changed the description
Noah Birrer marked this merge request as ready 10 months ago

marked this merge request as ready
Noah Birrer added statusreview label 10 months ago

added statusreview label
Noah Birrer added kindenhancement label and removed statusreview label 10 months ago

added kindenhancement label and removed statusreview label
Noah Birrer added statusreview label 10 months ago

added statusreview label
Noah Birrer added kindci label and removed kindenhancement label 10 months ago

added kindci label and removed kindenhancement label
Noah Birrer changed the description 10 months ago

changed the description
Noah Birrer added 5 commits 10 months ago
added 5 commits

0634f2a1...583600ee - 4 commits from branch master

38c2c39b - Merge branch 'master' into fix/failed-release-pod-logs

Compare with previous version
Christopher O'Connell requested review from @chris.oconnell 10 months ago

requested review from @chris.oconnell
Christopher O'Connell @chris.oconnell · 10 months ago

Owner

Resolved 10 months ago by Michael Martin

so all of this makes sense to me. Sounds like you've put alot of thought into it.

Any chance any of our existing helmReleases take longer than 10 minutes to install/reconcile? Assuming that would cause timeouts in this case? Just thinking this 60m timeout was probably added for a reason as an override?

Last reply by Michael Martin 10 months ago
bigbang bot requested review from @ryan.thompson.44, @andrewshoell, and @michaelmartin 10 months ago

requested review from @ryan.thompson.44, @andrewshoell, and @michaelmartin
Michael Martin resolved all threads 10 months ago

resolved all threads

Please register or sign in to reply

UNCLASSIFIED - NO CUI