Fix OOMKilled issues in Package / BB MR pipelines

changed milestone to %2.30.0

added kindbug priority1 teamDevelopment & Ops labels

changed iteration to Big Bang Iterations Jun 11, 2024 - Jun 24, 2024

changed the description

Just wanted to add some clarification here as I feel like this could go in a wrong direction very easily. The helm tests (cypress included) should not be firing if there is more infrastructure/services required for the actual package to be considered healthy and ready for testing. In the past, we've had to go through and make sure we're removing all of the unnecessary waits and timeouts from Cypress tests as they were greatly increasing the amount of time it took for tests to complete and often still ended up failing.

The wait.sh script is the intended solution for these sort of race conditions if that is what's going on. Those are executed as part of the Package_Wait (prior to executing tests) step and ensure that all of the required items are in place (i.e. any additional k8s services, resources, etc.). The reason is because a helm chart can be seen as successfully installed even though its not 100% fully operational just yet.

If you check the logs under the artifacts for the last failed install job you'll see that the Fortify web app crashed:

https://repo1.dso.mil/big-bang/product/packages/fortify/-/jobs/35507632/artifacts/file/events.txt

May need to do a bit more digging to identify the actual root cause, but the cypress test failure isn't the cause, it's the symptom.

assigned to @alfredodiaz53

created branch 120-fix-cypress-test-race-condition to address this issue

mentioned in merge request !150 (merged)

I see the Fortify web app crashing. logs don't say much. Looks like the app server starts properly but does not receive any requests, then suddenly fails.

mentioned in merge request !153 (closed)

changed the description

changed title from Fix Cypress Test Race Condition to Fix OOMKilled issues in Package / BB MR pipelines

@alfredodiaz53 identified root cause as OOMKilled (Out of Memory). I have edited the issue title and description to reflect this.

@cshannon @julian.hair I believe you did some work on estimating resources, do you have anything to add to this?

I don't even see Fortify on the list of apps we observed, so looking into that.. Looking at the screenshot I can see the the QoS is Guaranteed so it's not being evicted for higher priority workloads. I'm going to fire up Grafana and take a look..

I currently only have access to the DogFood cluster which isn't running Fortify. This explains why we missed it.

However, as I think we all know with OOMKilled, either bump the memory limits or if this keeps happening there's likely a memory leak that needs to be addressed. The node could be over-committed as well in which case adding another node could resolve the issue. Ideally, we can eventually get the other workloads right-sized so we won't have to "add a node".

Exit code 137 indicates that the container was terminated due to an out of memory issue so it's likely a memory limit bump or an HPA could resolve.

It’s already pretty high in my env. I’ll double

@alfredodiaz53 we need to remove the overrides (test values for Package and BB) altogether so the default is used from values.yaml per @chris.oconnell

Values.yaml is currently set to:

resources:
  limits:
    cpu: 4
    memory: 16Gi
  requests:
    cpu: 1
    memory: 1Gi

This should be a sufficient memory limit vs the 1Gi being given in the test values.. There are no issues when testing on k3d-dev with OOM which uses the above defaults.

Having gathered metrics for most of the BB apps in DogFood, I bet the above limits are more than generous.

set weight to 1

added statusdoing label

mentioned in commit 87c7391b

closed with merge request !150 (merged)

mentioned in merge request big-bang/bigbang!4516 (merged)

mentioned in commit big-bang/bigbang@4995b938

Fix OOMKilled issues in Package / BB MR pipelines

Bug

Description

BigBang Version

Designs

Child items 0

Activity

Admin message

Fix OOMKilled issues in Package / BB MR pipelines

Bug

Description

BigBang Version

Activity