UNCLASSIFIED - NO CUI

Default Prometheus Configuration allows WAL to grow indefinitely

Summary

Using the default storage settings for .Values.prometheus.prometheusSpec:, IronBank noticed that the Prometheus WAL log had grown to fill the entire 30Gi VolumeGroup used by the node host for ephemeral pod storage (/var/lib/kubelet/pods/).

Symptoms

The following logs were collected from the prometheus-monitoring-monitoring-kube-prometheus-0 pod in the monitoring namespace:

level=warn ts=2021-10-20T15:02:19.491Z caller=manager.go:619 component="rule manager" group=kubernetes-resources msg="Rule sample appending failed" err="write to WAL: log samples: write /prometheus/wal/00000714: no space left on device"
level=warn ts=2021-10-20T15:02:19.542Z caller=manager.go:619 component="rule manager" group=istio.metricsAggregation-rules msg="Rule sample appending failed" err="write to WAL: log samples: write /prometheus/wal/00000714: no space left on device"
level=warn ts=2021-10-20T15:02:19.549Z caller=manager.go:619 component="rule manager" group=istio.metricsAggregation-rules msg="Rule sample appending failed" err="write to WAL: log samples: write /prometheus/wal/00000714: no space left on device"
level=warn ts=2021-10-20T15:02:19.556Z caller=manager.go:619 component="rule manager" group=istio.metricsAggregation-rules msg="Rule sample appending failed" err="write to WAL: log samples: write /prometheus/wal/00000714: no space left on device"
level=error ts=2021-10-20T15:02:19.619Z caller=scrape.go:1088 component="scrape manager" scrape_pool=monitoring/monitoring-monitoring-kube-istio-envoy/0 target=http://192.168.54.31:15020/stats/prometheus msg="Scrape commit failed" err="write to WAL: log samples: write /prometheus/wal/00000714: no space left on device"

Execing to the prometheus pod to confirm:

bash-4.4$ df -Th /prometheus/
Filesystem                    Type  Size  Used Avail Use% Mounted on
/dev/mapper/VolGroup00-varVol ext4   30G   29G     0 100% /prometheus

Where /prometheus/ is mapped to /var/lib/kubelet/pods on the host machine.

These failures resulted in a non-responsive prometheus pod, which causes HPAs to fail to gather metrics from the v1beta1.metrics.k8s.io APIService.

Temporary Solution

In the short term, restarting the pod flushes all data in the WAL, allowing HPAs to gather metrics and scale replica sets again.

Medium term, we added the following settings to keep prometheus from being in a persistently bad state:

monitoring:
  values:
    prometheus:
      prometheusSpec:
        nodeSelector:
          ironbank: prometheus
        tolerations:
          - key: ironbank
            effect: NoSchedule
            value: prometheus
        walCompression: true # https://github.com/prometheus-operator/prometheus-operator/blob/master/Documentation/api.md#prometheusspec
        resources:
          limits:
            ephemeral-storage: 20Gi # So we die if we approach our on-host storage limit
            memory: null
            cpu: null
          requests:
              cpu: 300m
              memory: 5Gi

Long term, we should understand the particular requirements of our prometheus deployment and set prometheus configuration accordingly.

Why am I doing this?

General awareness that the default configuration for the monitoring package doesn't prevent it from consuming all ephemeral pod storage, i.e. being a noisy neighbor.