UNCLASSIFIED - NO CUI

Skip to content
Snippets Groups Projects

Update HA monitoring architecture doc

Merged Ghost User requested to merge investigate-ha-monitoring into master
All threads resolved!
@@ -198,20 +198,45 @@ Note: Other packages are responsible for deploying Service Monitors for their co
### HA
High Availability can be accomplished by increasing the number of replicas for the deployments of Alertmanager, Prometheus and Grafana:
High Availability can be accomplished by increasing the number of replicas for the deployments of Alertmanager, Grafana, and Prometheus :
```yaml
monitoring:
values:
alertmanager:
alertmanagerSpec:
replicas:
replicas: 3
prometheus:
thanosService:
enabled: true
thanosServiceMonitor:
enabled: true
prometheusSpec:
replicas:
replicas: 3
grafana:
replicas:
replicas: 3
```
Notes for HA :
- Alert Manager with webbooks to MatterMost
- network policies should be disabled
- authorization policies must be deleted
- a coreDNS entry was added to allow the webhook to connect
```
NodeHosts:
172.18.0.2 chat.bigbang.dev
```
- Grafana
- a persistent database must be used or auth tokens are lost as users pass between pods
- https://grafana.com/docs/grafana/latest/setup-grafana/set-up-for-high-availability/
- https://grafana.com/docs/grafana/v9.0/setup-grafana/configure-grafana/#database
- no other issues observed
- Prometheus
- initial testing indicates no issues for HA when `thanos` is enabled
- sub-chart for `thanos` needs to be added to monitoring
- `thanos` can be pulled in currently as documented [here](https://repo1.dso.mil/platform-one/big-bang/apps/sandbox/thanos/-/tree/test/chart)
- taken from [VMware](https://docs.vmware.com/en/VMware-Application-Catalog/services/apps/GUID-apps-thanos-administration-enable-metrics.html)
### Dependency Packages
Loading