UNCLASSIFIED - NO CUI

Add resiliency to auto unseal job

During testing and deployment I noticed vault auto unseal job didn't always initialize correctly, or didn't write the secret everytime. So I added some resiliency to the configMap and job that could benefit others.

ConfigMap edits:

init.sh: |-
  KEYS_FOLDER="/vault/data"
  METRICS_POLICY_NAME="prometheus-metrics"
  METRICS_ROLE_NAME="prometheus"
  MONITORING_SERVICE_ACCOUNT_NAME="monitoring-monitoring-kube-prometheus"
  MONITORING_NAMESPACE="monitoring"
  INIT_OUT=/export/init.out
  export VAULT_ADDR=https://vault.{{ domain }}
  until curl -L -s -k -f $VAULT_ADDR/v1/sys/seal-status | grep 'initialized' >& /dev/null; do
    echo "---=== Waiting For Vault Server ===---"; 
    sleep 5; 
  done
  echo "---=== Initializing Vault ===---"
  until vault operator init -address=$VAULT_ADDR > $INIT_OUT; do
    echo "retry initialize"
    sleep 5;
  done
  export VAULT_TOKEN=$(grep Token $INIT_OUT | cut -d' ' -f  4)
  echo "---=== VAULT_TOKEN written to /export/key ===---"
  echo $VAULT_TOKEN > /export/key
  MIN_MASTER_KEYS=$(cat $INIT_OUT | grep -e "2:\|3:\|4:" |  awk '{print $4}')
  KEY_NUMBER=1
  for key in $MIN_MASTER_KEYS
  do
      echo '{"key": "'"$key"'"}' > "$KEYS_FOLDER/master_keys_$KEY_NUMBER.json"
      curl --request PUT --data @"$KEYS_FOLDER/master_keys_$KEY_NUMBER.json" "$VAULT_ADDR/v1/sys/unseal"
      KEY_NUMBER=$(( $KEY_NUMBER + 1 ))
  done
  echo "---=== Logging in ===---"
  until vault login -no-store $VAULT_TOKEN >& /dev/null; do
    echo "Waiting to login to vault"; 
    sleep 5; 
  done
  echo "---=== Login Success ===---"
  echo "---=== Enabling Kubernetes ===---"
  until vault auth enable kubernetes >& /dev/null; do
    echo "retry kubernetes enable"; 
    sleep 5; 
  done
  echo "---=== Configuring Kubernetes ===---"
  until vault write auth/kubernetes/config \
    kubernetes_host="https://$KUBERNETES_PORT_443_TCP_ADDR:443" \
    token_reviewer_jwt="$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)" \
    kubernetes_ca_cert=@/var/run/secrets/kubernetes.io/serviceaccount/ca.crt \
    issuer="https://kubernetes.default.svc.cluster.local"; do
    echo "retry kuberbetes config";
    sleep 5;
  done
  echo "---=== Writing $METRICS_POLICY_NAME Policy ===---"
  until vault policy write $METRICS_POLICY_NAME - << EOF
  path "/sys/metrics" { 
  capabilities = ["read"]
  }
  EOF
  do
    echo "retry policy write";
    sleep 5;
  done
  echo "---=== Reading $METRICS_POLICY_NAME Policy ===---"
  until vault policy read $METRICS_POLICY_NAME; do
    echo "retry read";
    sleep 5;
  done
  echo "---=== Writing $METRICS_POLICY_NAME Auth ===---"
  until vault write auth/kubernetes/role/$METRICS_ROLE_NAME \
    bound_service_account_names=$MONITORING_SERVICE_ACCOUNT_NAME \
    bound_service_account_namespaces=$MONITORING_NAMESPACE \
    policies=$METRICS_POLICY_NAME ttl=15m; do
    echo "retry write auth";
    sleep 5;
  done
  exit 0

basically just adds untils around all the commands to make sure they run. also changed the url to test to the seal status page it's cleaner.

Job edits:

containers:
- name: bigbang-base-secret-creation
  command:
    - /bin/bash
    - -c
    - |
      echo "---=== Writing secret ===---"
      until kubectl create secret generic vault-token --from-file=key=/export/key --from-file=init.out=/export/init.out >& /dev/null; do
        echo "Retry writing secret"
        sleep 5;
      done
      echo "Killing Istio Sidecar"
      curl -X POST http://localhost:15020/quitquitquit
      exit 0

Switches from using a sleep to using the until. This ensures the secret actually gets created and keeps trying until it does.

Here is an example of it running and re-running some steps due to connection issues:

$ kubectl logs -n vault pod/vault-vault-job-init-jht4z -c vault-init-job -f
---=== Waiting For Vault Server ===---
---=== Waiting For Vault Server ===---
---=== Initializing Vault ===---
---=== VAULT_TOKEN written to /export/key ===---
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100    95  100    40  100    55    189    260 --:--:-- --:--:-- --:--:--   450
{"errors":["Vault is not initialized"]}
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
curl: (35) OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to vault.{{ domain }}:443 
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   351  100   296  100    55    127     23  0:00:02  0:00:02 --:--:--   151
{"type":"shamir","initialized":true,"sealed":false,"t":3,"n":5,"progress":0,"nonce":"","version":"1.13.1","build_date":"2023-03-23T12:51:35Z","migration":false,"cluster_name":"vault-cluster-cabb745d","cluster_id":"6e16cc5b-cc7a-74c9-4f39-0f3814366d45","recovery_seal":true,"storage_type":"raft"}
---=== Logging in ===---
---=== Login Success ===---
---=== Enabling Kubernetes ===---
retry kubernetes enable
---=== Configuring Kubernetes ===---
Success! Data written to: auth/kubernetes/config
---=== Writing prometheus-metrics Policy ===---
Success! Uploaded policy: prometheus-metrics
---=== Reading prometheus-metrics Policy ===---
Error reading policy named prometheus-metrics: Error making API request.

URL: GET https://vault.{{ domain }}/v1/sys/policies/acl/prometheus-metrics
Code: 503. Errors:

* Vault is sealed
retry read
path "/sys/metrics" { 
capabilities = ["read"]
}
---=== Writing prometheus-metrics Auth ===---
Success! Data written to: auth/kubernetes/role/prometheus

should also add an optional network policy to monitoring if vault is enabled to allow 443 egress to the vault url. otherwise monitoring wont work.