bug: Flux reliability bug caused by OOM Errors
Bug
Description
During the Sept 28 - Oct 1st BB Engineering cohort lead by @cmcgrath and @sbhatia we had a group use the flux install method here:
Describe the problem, what were you doing when you noticed the bug? During the cohort we noticed lots of unexpected bugs and flux reconcile not working correctly. Ex dev value says reconcile will happen in 1 min, after enabling argocd argocd would take 10 minutes to properly be deployed. (This was because flux had died and it took a while for flux to revive b4 it could reconcile.)
We eventually got lucky and the k describe kustomization bigbang -n=bigbang
surfaced a rare error message mentioning OOM. (Out of Memory).
We then ran k get pods -n=flux-system
and saw the kustomization controller had 17 reboots and log messages meant OOM.
We need to up flux's resource request and limit. Probably the limit (I suspect flux's memory usage hit the resource limit and the pod killed itself due to OOM, as it definitely wasn't a node OOM error.)
Steps to reproduce = deploy flux using the script and use bigbang.
BigBang Version
What version of BigBang were you running? 1.15.3 (but this is irrelevant to this issue)