[P1BIGROCKS-2785] Add Horizontal or Vertical Pod Autoscaling on all long-living pods in Big Bang
Problem Statement:
In the TOC, a customer brought up that they have no idea if the default settings for BB for CPU/Memory are sufficient for their production load. Each customer must go through a "tuning" process to identify under or over allocated resources. Under allocated resources can cause slowness or service failure. Over allocated resources can cause scheduling problems or additional cost for running unnecessary nodes in the cluster.
In BB, if the default requests/limits are too large, smaller clusters would need to spin up larger, costlier nodes to accommodate. If the requests/limits were too small, larger customers would not have sufficient resources and be required to override settings. Big Bang chose to balance cost vs. quality by evaluating a development cluster and adding margin. But, this does not meet all customer needs.
Big Bang could create default limits/requests for small, medium, large, and x-large cluster sizes. However, this is a very time consuming and subjective effort that would likely still fall short of meeting the customer needs for right sizing.
Proposed Solutions:
- Wherever possible, implement Horizontal Pod Autoscaling (HPA) on pods. This allows pods to scale up/down replicas based on dynamic load. This is BETTER than vertical autoscaling since it provides high-availability and eases scheduling. In several areas of BB, HPA is already used. If a pod supports HPA, it should be configured to use it.
- If HPA is not available, implement Vertical Pod Autoscaling (VPA). This would allow pods to have memory/cpu requests and limits scaled up/down dynamically based on load. Rolling updates would be required to avoid downtime.
Acceptance:
- For each package, identify whether HPA is supported on the pod controllers. If it is, make sure it is configured to use it by default. If the application is critical, the minimum should be 2 or more replicas to support HA.
- For each package, identify all pod controllers that don't support replicas. Configure rolling updates with maxSurge = 1 and maxUnavailable = 0. Add a vertical pod autoscaler resource to control.
- Add Helm chart for deploying vertical pod autoscaler with Big Bang
- Containers must be sent through Iron Bank
- Must support different modes for scaling (auto, off, etc.)
- In monitoring package, enable collector for vertical pod autoscaler in kube-state-metrics.
- Add dashboard for vertical pod autoscaler (e.g. https://grafana.com/grafana/dashboards/14588)
- Documentation on how to right size your pods (e.g. no rightsizing needed if HPA/VPA enabled. If VPA is off, look at dashboard to identify changes required)
You should not use both HPA and VPA on the same resource
Open Questions:
- Is there a downside to allowing VPA to auto update? Should we run in "Off" mode to just get recommendations in dashboards and expect customers to manually update?
- Will flux reset values or will the VPA mutate it on the way in?
- Show closed items