There are currently OOMKilled issues with pod/fortify-ssc-webapp-0 which sometimes results in a failed clean install job. We must ensure that the clean install job passes consistently by appropriately increasing resource limits for the test values used in both the Package and Big Bang Pipelines. This will most likely need to match the default resource limits set in the package values.yaml.
Just wanted to add some clarification here as I feel like this could go in a wrong direction very easily. The helm tests (cypress included) should not be firing if there is more infrastructure/services required for the actual package to be considered healthy and ready for testing. In the past, we've had to go through and make sure we're removing all of the unnecessary waits and timeouts from Cypress tests as they were greatly increasing the amount of time it took for tests to complete and often still ended up failing.
The wait.sh script is the intended solution for these sort of race conditions if that is what's going on. Those are executed as part of the Package_Wait (prior to executing tests) step and ensure that all of the required items are in place (i.e. any additional k8s services, resources, etc.). The reason is because a helm chart can be seen as successfully installed even though its not 100% fully operational just yet.
If you check the logs under the artifacts for the last failed install job you'll see that the Fortify web app crashed:
I see the Fortify web app crashing. logs don't say much. Looks like the app server starts properly but does not receive any requests, then suddenly fails.
I don't even see Fortify on the list of apps we observed, so looking into that.. Looking at the screenshot I can see the the QoS is Guaranteed so it's not being evicted for higher priority workloads. I'm going to fire up Grafana and take a look..
I currently only have access to the DogFood cluster which isn't running Fortify. This explains why we missed it.
However, as I think we all know with OOMKilled, either bump the memory limits or if this keeps happening there's likely a memory leak that needs to be addressed. The node could be over-committed as well in which case adding another node could resolve the issue. Ideally, we can eventually get the other workloads right-sized so we won't have to "add a node".
This should be a sufficient memory limit vs the 1Gi being given in the test values.. There are no issues when testing on k3d-dev with OOM which uses the above defaults.