UNCLASSIFIED - NO CUI

Resolve deployment of ESO webhook certificates, CRDs and CRs

TL;DR - Make bigbang external-secrets do this and pipelines should pass

When deploying external-secrets, something is preventing the deployment of certificates and/or CRDs.

$ kubectl get pods -n external-secrets -o wide
NAME                                                READY   STATUS             RESTARTS       AGE   IP          NODE                      NOMINATED NODE   READINESS GATES
external-secrets-cert-controller-545f79bb6f-8qfgb   0/1     Running            0              28m   10.42.2.6   k3d-k3s-default-agent-2   <none>           <none>
external-secrets-webhook-6b876f4974-mfxtm           0/1     CrashLoopBackOff   7 (104s ago)   28m   10.42.2.8   k3d-k3s-default-agent-2   <none>           <none>
external-secrets-7888b6b965-w759b                   0/1     CrashLoopBackOff   7 (91s ago)    28m   10.42.2.7   k3d-k3s-default-agent-2   <none>           <none>

$ kubectl events -n external-secrets | grep external-secrets-webhook
29m                     Normal    ScalingReplicaSet   Deployment/external-secrets-webhook                      Scaled up replica set external-secrets-webhook-6b876f4974 to 1
29m                     Normal    SuccessfulCreate    ReplicaSet/external-secrets-webhook-6b876f4974           Created pod: external-secrets-webhook-6b876f4974-mfxtm
29m                     Normal    Scheduled           Pod/external-secrets-webhook-6b876f4974-mfxtm            Successfully assigned external-secrets/external-secrets-webhook-6b876f4974-mfxtm to k3d-k3s-default-agent-2
29m                     Normal    Pulling             Pod/external-secrets-webhook-6b876f4974-mfxtm            Pulling image "registry1.dso.mil/ironbank/opensource/external-secrets/external-secrets:v0.9.15"
29m                     Normal    Pulled              Pod/external-secrets-webhook-6b876f4974-mfxtm            Successfully pulled image "registry1.dso.mil/ironbank/opensource/external-secrets/external-secrets:v0.9.15" in 1.785s (1.785s including waiting)
29m                     Normal    Created             Pod/external-secrets-webhook-6b876f4974-mfxtm            Created container webhook
29m                     Normal    Started             Pod/external-secrets-webhook-6b876f4974-mfxtm            Started container webhook
9m14s (x29 over 25m)    Warning   BackOff             Pod/external-secrets-webhook-6b876f4974-mfxtm            Back-off restarting failed container webhook in pod external-secrets-webhook-6b876f4974-mfxtm_external-secrets(4c359d8c-4dcf-471f-9518-8d9340194ba8)
4m13s (x148 over 28m)   Warning   Unhealthy           Pod/external-secrets-webhook-6b876f4974-mfxtm            Readiness probe failed: Get "http://10.42.2.8:8081/readyz": dial tcp 10.42.2.8:8081: connect: connection refused

$ kubectl logs -n external-secrets external-secrets-webhook-6b876f4974-mfxtm | tail -n 2
{"level":"error","ts":1715335816.9225318,"logger":"setup","msg":"invalid certs. retrying...","error":"stat /tmp/certs/tls.crt: no such file or directory","stacktrace":"github.com/external-secrets/external-secrets/cmd.waitForCerts\n\t/home/runner/work/external-secrets/external-secrets/cmd/webhook.go:215\ngithub.com/external-secrets/external-secrets/cmd.glob..func3\n\t/home/runner/work/external-secrets/external-secrets/cmd/webhook.go:84\ngithub.com/spf13/cobra.(*Command).execute\n\t/home/runner/go/pkg/mod/github.com/spf13/cobra@v1.8.0/command.go:987\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\t/home/runner/go/pkg/mod/github.com/spf13/cobra@v1.8.0/command.go:1115\ngithub.com/spf13/cobra.(*Command).Execute\n\t/home/runner/go/pkg/mod/github.com/spf13/cobra@v1.8.0/command.go:1039\ngithub.com/external-secrets/external-secrets/cmd.Execute\n\t/home/runner/work/external-secrets/external-secrets/cmd/root.go:255\nmain.main\n\t/home/runner/work/external-secrets/external-secrets/main.go:22\nruntime.main\n\t/opt/hostedtoolcache/go/1.21.9/x64/src/runtime/proc.go:267"}
{"level":"error","ts":1715335826.9229984,"logger":"setup","msg":"unable to validate certificates","error":"context deadline exceeded","stacktrace":"github.com/external-secrets/external-secrets/cmd.glob..func3\n\t/home/runner/work/external-secrets/external-secrets/cmd/webhook.go:86\ngithub.com/spf13/cobra.(*Command).execute\n\t/home/runner/go/pkg/mod/github.com/spf13/cobra@v1.8.0/command.go:987\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\t/home/runner/go/pkg/mod/github.com/spf13/cobra@v1.8.0/command.go:1115\ngithub.com/spf13/cobra.(*Command).Execute\n\t/home/runner/go/pkg/mod/github.com/spf13/cobra@v1.8.0/command.go:1039\ngithub.com/external-secrets/external-secrets/cmd.Execute\n\t/home/runner/work/external-secrets/external-secrets/cmd/root.go:255\nmain.main\n\t/home/runner/work/external-secrets/external-secrets/main.go:22\nruntime.main\n\t/opt/hostedtoolcache/go/1.21.9/x64/src/runtime/proc.go:267"}

$ kubectl events -A | grep -i UpdateFailed
default            9m50s (x19 over 31m)    Warning   UpdateFailed                     ValidatingWebhookConfiguration/externalsecret-validate   ca cert not yet ready
default            9m50s (x19 over 31m)    Warning   UpdateFailed                     ValidatingWebhookConfiguration/secretstore-validate      ca cert not yet ready

The chart is supposed to be setting up its own certificates via webhook.certManager but that doesn't appear to be working.

In addition to (perhaps because-of?) this, the CRDs aren't being deployed properly.

$ kubectl apply -f secretstore.yaml 
secret/webhook-credentials unchanged
error: resource mapping not found for name: "webhook-backend" namespace: "external-secrets" from "secretstore.yaml": no matches for kind "SecretStore" in version "external-secrets.io/v1alpha1"
ensure CRDs are installed first

... The ESO documentation appears to imply that the CRD issue is due to a known race condition in the upstream helm chart when using flux and a workaround is provided via kustomization. I don't think our implementation is using this method.

This race condition appears to be what's causing the certificate failure (and subsequent context timeouts in our pipeline runs). Following traces through the logs into the source code, external-secrets/cmd/webhook.go:waitForCerts uses the external-secrets CRD library to check certificates on disk. The cert controller uses this same library, there are some messages in its log about the Reconciler class. There is a comment on the Reconciler:

	// the controller is ready when all crds are injected
	// and the controller is elected as leader

... All this just to say that our pipeline errors related to a context timeout appear to be directly related to this known timing issue when using flux, and implementing the upstream solution should solve it.

Edited by Andrew Kesterson