Resolve deployment of ESO webhook certificates, CRDs and CRs
TL;DR - Make bigbang external-secrets do this and pipelines should pass
When deploying external-secrets, something is preventing the deployment of certificates and/or CRDs.
$ kubectl get pods -n external-secrets -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
external-secrets-cert-controller-545f79bb6f-8qfgb 0/1 Running 0 28m 10.42.2.6 k3d-k3s-default-agent-2 <none> <none>
external-secrets-webhook-6b876f4974-mfxtm 0/1 CrashLoopBackOff 7 (104s ago) 28m 10.42.2.8 k3d-k3s-default-agent-2 <none> <none>
external-secrets-7888b6b965-w759b 0/1 CrashLoopBackOff 7 (91s ago) 28m 10.42.2.7 k3d-k3s-default-agent-2 <none> <none>
$ kubectl events -n external-secrets | grep external-secrets-webhook
29m Normal ScalingReplicaSet Deployment/external-secrets-webhook Scaled up replica set external-secrets-webhook-6b876f4974 to 1
29m Normal SuccessfulCreate ReplicaSet/external-secrets-webhook-6b876f4974 Created pod: external-secrets-webhook-6b876f4974-mfxtm
29m Normal Scheduled Pod/external-secrets-webhook-6b876f4974-mfxtm Successfully assigned external-secrets/external-secrets-webhook-6b876f4974-mfxtm to k3d-k3s-default-agent-2
29m Normal Pulling Pod/external-secrets-webhook-6b876f4974-mfxtm Pulling image "registry1.dso.mil/ironbank/opensource/external-secrets/external-secrets:v0.9.15"
29m Normal Pulled Pod/external-secrets-webhook-6b876f4974-mfxtm Successfully pulled image "registry1.dso.mil/ironbank/opensource/external-secrets/external-secrets:v0.9.15" in 1.785s (1.785s including waiting)
29m Normal Created Pod/external-secrets-webhook-6b876f4974-mfxtm Created container webhook
29m Normal Started Pod/external-secrets-webhook-6b876f4974-mfxtm Started container webhook
9m14s (x29 over 25m) Warning BackOff Pod/external-secrets-webhook-6b876f4974-mfxtm Back-off restarting failed container webhook in pod external-secrets-webhook-6b876f4974-mfxtm_external-secrets(4c359d8c-4dcf-471f-9518-8d9340194ba8)
4m13s (x148 over 28m) Warning Unhealthy Pod/external-secrets-webhook-6b876f4974-mfxtm Readiness probe failed: Get "http://10.42.2.8:8081/readyz": dial tcp 10.42.2.8:8081: connect: connection refused
$ kubectl logs -n external-secrets external-secrets-webhook-6b876f4974-mfxtm | tail -n 2
{"level":"error","ts":1715335816.9225318,"logger":"setup","msg":"invalid certs. retrying...","error":"stat /tmp/certs/tls.crt: no such file or directory","stacktrace":"github.com/external-secrets/external-secrets/cmd.waitForCerts\n\t/home/runner/work/external-secrets/external-secrets/cmd/webhook.go:215\ngithub.com/external-secrets/external-secrets/cmd.glob..func3\n\t/home/runner/work/external-secrets/external-secrets/cmd/webhook.go:84\ngithub.com/spf13/cobra.(*Command).execute\n\t/home/runner/go/pkg/mod/github.com/spf13/cobra@v1.8.0/command.go:987\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\t/home/runner/go/pkg/mod/github.com/spf13/cobra@v1.8.0/command.go:1115\ngithub.com/spf13/cobra.(*Command).Execute\n\t/home/runner/go/pkg/mod/github.com/spf13/cobra@v1.8.0/command.go:1039\ngithub.com/external-secrets/external-secrets/cmd.Execute\n\t/home/runner/work/external-secrets/external-secrets/cmd/root.go:255\nmain.main\n\t/home/runner/work/external-secrets/external-secrets/main.go:22\nruntime.main\n\t/opt/hostedtoolcache/go/1.21.9/x64/src/runtime/proc.go:267"}
{"level":"error","ts":1715335826.9229984,"logger":"setup","msg":"unable to validate certificates","error":"context deadline exceeded","stacktrace":"github.com/external-secrets/external-secrets/cmd.glob..func3\n\t/home/runner/work/external-secrets/external-secrets/cmd/webhook.go:86\ngithub.com/spf13/cobra.(*Command).execute\n\t/home/runner/go/pkg/mod/github.com/spf13/cobra@v1.8.0/command.go:987\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\t/home/runner/go/pkg/mod/github.com/spf13/cobra@v1.8.0/command.go:1115\ngithub.com/spf13/cobra.(*Command).Execute\n\t/home/runner/go/pkg/mod/github.com/spf13/cobra@v1.8.0/command.go:1039\ngithub.com/external-secrets/external-secrets/cmd.Execute\n\t/home/runner/work/external-secrets/external-secrets/cmd/root.go:255\nmain.main\n\t/home/runner/work/external-secrets/external-secrets/main.go:22\nruntime.main\n\t/opt/hostedtoolcache/go/1.21.9/x64/src/runtime/proc.go:267"}
$ kubectl events -A | grep -i UpdateFailed
default 9m50s (x19 over 31m) Warning UpdateFailed ValidatingWebhookConfiguration/externalsecret-validate ca cert not yet ready
default 9m50s (x19 over 31m) Warning UpdateFailed ValidatingWebhookConfiguration/secretstore-validate ca cert not yet ready
The chart is supposed to be setting up its own certificates via webhook.certManager
but that doesn't appear to be working.
In addition to (perhaps because-of?) this, the CRDs aren't being deployed properly.
$ kubectl apply -f secretstore.yaml
secret/webhook-credentials unchanged
error: resource mapping not found for name: "webhook-backend" namespace: "external-secrets" from "secretstore.yaml": no matches for kind "SecretStore" in version "external-secrets.io/v1alpha1"
ensure CRDs are installed first
... The ESO documentation appears to imply that the CRD issue is due to a known race condition in the upstream helm chart when using flux and a workaround is provided via kustomization. I don't think our implementation is using this method.
This race condition appears to be what's causing the certificate failure (and subsequent context timeouts in our pipeline runs). Following traces through the logs into the source code, external-secrets/cmd/webhook.go:waitForCerts uses the external-secrets CRD library to check certificates on disk. The cert controller uses this same library, there are some messages in its log about the Reconciler class. There is a comment on the Reconciler:
// the controller is ready when all crds are injected
// and the controller is elected as leader
... All this just to say that our pipeline errors related to a context timeout appear to be directly related to this known timing issue when using flux, and implementing the upstream solution should solve it.