UNCLASSIFIED - NO CUI

Skip to content

Prevent Failed Jobs From Queuing Up

General MR

Summary

On early job errors, we need to clean up failed jobs. This includes initContainer errors or image pull errors. ttlSecondsAfterFinished doesn't work for these conditions, since the job has neither started nor completed.

set "concurrencyPolicy: Forbid"

concurrencyPolicy: Forbid helps if the prior job fails on an imagePullError or something similar.

Relevant logs/screenshots

Before fix -- running the jobs every 2 mins after ~8mins:

image

After fix -- running the jobs every 2 mins after ~8mins -- we max out at 6 ( .spec.backoffLimit ) for the initContainer errors, and max out at 1 for the ImagePullBackOff:

image

Linked Issue

n/a

Upgrade Notices

By default, bbctl runs a handful of cronjobs. In prior releases, these jobs could get stacked up in the event of a container startup error. This release prevents multiple jobs from running concurrently by setting spec.concurrencyPolicy: Forbid . This limits the number of pods per job in an error state that will run.

Old, unfinished jobs and their pods can be removed by running this command:

$ kubectl -n bbctl delete jobs --all

Jobs will then be re-run on the next scheduled interval which can be viewed with:

$ kubectl get cronjobs -n bbctl
Edited by Michael Martin

Merge request reports

Loading