UX / Security Improvement: Update Flux config from try 3 times to infinite retries
Summary of change:
Update Flux config from 3 retries to -1 retries (infinite).
Difficulty/Size of change: Trivial/Tiny Config update
Roughly where in the codebase the change would need to occur:
https://repo1.dso.mil/platform-one/big-bang/bigbang/-/blob/master/scripts/install_flux.sh
https://repo1.dso.mil/platform-one/big-bang/bigbang/-/tree/master/base
Why this change should be implemented:
1. Improves User Experience:
Although GitOps and Kubernetes ends up being declarative, customers have to imperatively figure out what the correct declarative configuration is. If flux is set to retry 3, 10, or even 100 times what ends up happening is that while the user is imperatively figuring out what the correct configuration should look like, they'll often run into a bad config, go to sleep, and then flux has given up. The user then needs to run lots of commands to tell flux to try again.
flux suspend source git $1 -n bigbang
flux resume source git $1 -n bigbang
flux suspend hr $1 -n bigbang
flux resume hr $1 -n bigbang
If we set infinite retries then flux will never give up and need those imperative commands to be run to get it unstuck, which will result in a better User Experience.
2. Improves Security Posture:
A big theoretical benefit of GitOps is that you can improve your security posture by forcing members of your Ops teams to use GitOps / not allow anyone direct kubectl access which could lead to out of band changes, improves principle of least privilege, and the GitOps workflow forces audit trial of changes.
If a user had a bad config in the git repo that caused flux to give up/exhaust it's 3 retries. Then the user fixes the declarative config in the git repo, nothing will happen because flux will have given up. The user will need to imperatively run the above commands to get declarative operations to work again (which is an Anti Pattern IMO). This means the user would need direct kubectl access to the cluster in order to be able to imperatively fix flux.
If we set infinite retries then a user could fix the config without direct kubectl access / when they only have access to a git repo in a proper GitOps fashion, (because infinite retries means Flux would never give up), and not needing direct kubectl access improves security posture.
3. Weak Antidotal Reasoning:
Infinite Retries is ArgoCD's default. (ArgoCD is a mature GitOps tool.)