bug: Flux reliability bug caused by OOM Errors

added kindbug label

added flux label

added priority4 teamcore/security labels

changed milestone to %1.18.0

changed iteration to Big Bang Iterations Sep 21, 2021 - Oct 4, 2021

I ran into another symptom of this problem here: https://repo1.dso.mil/platform-one/onboarding/big-bang/engineering-cohort/-/blob/master/lab_guides/10-customizing-and-extending-big-bang/E-use-gitops-to-deploy-argocd-app.md#useful-thing-1-using-gitops-crds-can-auto-recover-from-config-drift-which-could-be-used-as-a-security-control

When I did the second kubectl get deployment -n=nginx, even after 3-4 minutes, there were still 0/2 deployments ready. When I did kubectl get application -n=argocd either time, the application didn't come back like it was supposed to. When I did kubectl get kustomization -A, it included this line of output: bigbang deploy-apps False apply failed: signal: killed, kubectl process was killed, probably due to OOM 10m

I could be mistaken but if I'm reading the guide here correctly you actually ran into a different issue...that nginx application would be controlled by ArgoCD - so that would've been an OOM on argocd rather than flux.

Thanks for keeping high priority.
It's 100% flux issue (specifically the kustomization controller) and not ArgoCD Issue, here's some background context:

To cohort lab being referenced is nginx yaml manifests bundled via ArgoCD Application CR, and then using Kustomize to deploy that Application CR. (This is overly complex on purpose / for learning sake, where if they learn the complex, then they learn it.)

The OOM log message came from kustomization-controller pod in flux-system namespace.

Gotcha. I wouldn't be surprised if you guys also run into argo OOM limits in some cases, I know our testing of the defaults is based on one small app being deployed, I know some other customers have increased for their usage.

set weight to 2

assigned to @micah.nagel

added statusdoing label

added priority5 label and removed priority4 label

Priority 2 instead of 1 just because we haven't seen this complaint too much, but since its flux keeping it high priority. Working in this sprint regardless.

Upstream limits are set to 1G Mem / 1CPU for each controller. That's certainly overkill and we wouldn't want to set that for our default since limits = requests. Doing some investigation to see if I can produce a OOM kill and find a good new limit but barring that I'd advise we bump to 200m CPU / 400Mi Mem. That would double the limits but still keep it relatively reasonable (especially compared to 1/1 default from upstream).

Definitely want to gather data first so we can make an informed choice. Interestingly our minimum hardware reqs says to use 1 GB min, 1.5 GB recommended for all flux components - https://repo1.dso.mil/platform-one/big-bang/bigbang/-/blob/master/docs/guides/prerequisites/minimum_hardware_requirements.xlsx - which seems very high considering the helm controller (which should be the heavy hitter) does not even have that much in our current config.

This is what it's set to today
k get po kustomize-controller-666dfd7f97-vlgnd -n=flux-system -o yaml | grep resources: -A 6

    resources:
      limits:
        cpu: 100m
        memory: 200Mi
      requests:
        cpu: 100m
        memory: 200Mi

This spike went to 483Mi before OOM
The yellow dashes = memory limit of 200, which caused repeated restarts

400Mi won't work We'd need to bump to at least 500Mi, but I'd recommend we give it a smidge more breathing room giving how critical Flux is and consider putting it at 600Mi.

It's also worth noting that Grafana showed the CPU values were fine as is.

mentioned in merge request !951 (merged)

added statusreview label and removed statusdoing label

closed with merge request !951 (merged)

mentioned in commit 7d1cb19c

removed statusreview label

bug: Flux reliability bug caused by OOM Errors

Bug

Description

BigBang Version

Designs

Child items ...

Activity

Admin message

Admin message

bug: Flux reliability bug caused by OOM Errors

Bug

Description

BigBang Version

Activity