Prevent Failed Jobs From Queuing Up
General MR
Summary
On early job errors, we need to clean up failed jobs. This includes initContainer errors or image pull errors. ttlSecondsAfterFinished
doesn't work for these conditions, since the job has neither started nor completed.
set "concurrencyPolicy: Forbid"
concurrencyPolicy: Forbid helps if the prior job fails on an imagePullError or something similar.
Relevant logs/screenshots
Before fix -- running the jobs every 2 mins after ~8mins:
After fix -- running the jobs every 2 mins after ~8mins -- we max out at 6 ( .spec.backoffLimit ) for the initContainer errors, and max out at 1 for the ImagePullBackOff:
Linked Issue
n/a
Upgrade Notices
By default, bbctl runs a handful of cronjobs. In prior releases, these jobs could get stacked up in the event of a container startup error. This release prevents multiple jobs from running concurrently by setting spec.concurrencyPolicy: Forbid
. This limits the number of pods per job in an error state that will run.
Old, unfinished jobs and their pods can be removed by running this command:
$ kubectl -n bbctl delete jobs --all
Jobs will then be re-run on the next scheduled interval which can be viewed with:
$ kubectl get cronjobs -n bbctl