NeuVector upgrade controller crashes with mTLS STRICT
Bug
Description
During an upgrade of Big Bang I see NeuVector controller pods crash with the below log errors:
Initial error:
2023-10-27T16:11:02.441Z [ERROR] agent.server: Member has a conflicting expect value. All nodes should expect the same number.: member="{10.42.0.12 10.42.0.12 18301 map[acls:0 build:1.11.11:8a6d4151 dc:neuvector expect:3 ft_fs:1 ft_si:1 id:8d972b2a-f19a-b69f-b90f-b50f4efe89a4 port:18300 raft_vsn:3 role:consul segment: use_tls:1 vsn:2 vsn_max:3 vsn_min:2] alive 1 5 2 2 5 4}"
This error loops a few times and does seem relatively normal during an upgrade. Then I see the below logs look a number of times and the pod crashes:
2023-10-27T16:13:16.535|INFO|CTL|cluster.(*consulMethod).Start: Consul start - args=[agent -datacenter neuvector -data-dir /tmp/neuvector -server -bootstrap-expect 4 -config-file /tmp/neuvector/consul.json -bind 10.42.1.10 -advertise 10.42.1.10 -node 10.42.1.10 -node-id 6d0e080f-6c83-50d8-d97f-81d3e34821b6 -raft-protocol 3 -retry-join 10.42.1.9 -retry-join 10.42.0.12 -retry-join 10.42.2.8]
==> Starting Consul agent...
Version: '1.11.11'
Node ID: '6d0e080f-6c83-50d8-d97f-81d3e34821b6'
Node name: '10.42.1.10'
Datacenter: 'neuvector' (Segment: '<all>')
Server: true (Bootstrap: false)
Client Addr: [127.0.0.1] (HTTP: 8500, HTTPS: -1, gRPC: -1, DNS: -1)
Cluster Addr: 10.42.1.10 (LAN: 18301, WAN: -1)
Encrypt: Gossip: true, TLS-Outgoing: true, TLS-Incoming: true, Auto-Encrypt-TLS: false
==> Log data will now stream in as it occurs:
2023-10-27T16:13:16.572Z [ERROR] agent: Error starting agent: error="Failed to start Consul server: Failed to start RPC layer: listen tcp 10.42.1.10:18300: bind: address already in use"
2023-10-27T16:13:16.574|ERRO|CTL|cluster.(*consulMethod).Start: Consul process exit - error=exit status 1
2023-10-27T16:13:16.574|ERRO|CTL|cluster.StartCluster: Failed to start cluster - error=exit status 1
2023-10-27T16:13:16.574|ERRO|CTL|main.clusterStart: Cluster failed - error=exit status 1 waited=5m0s
2023-10-27T16:13:16|MON|Process ctrl exit status 255, pid=642
Leave the cluster
2023-10-27T16:13:20.036Z [ERROR] agent: Coordinate update error: error="No cluster leader"
2023-10-27T16:13:20.036Z [ERROR] agent.anti_entropy: failed to sync remote state: error="No cluster leader"
Graceful leave complete
2023-10-27T16:13:25|MON|Clean up.
Eventually the pods come healthy and everything reconciles, but I've seen it take upwards of 30+ minutes and 6 crashes before a single pod comes healthy. Interestingly I have never seen this failure when I turn NeuVector's Istio mTLS mode to PERMISSIVE (or delete the peerauthentications). Based on the logs it seems like something with the NeuVector consul traffic is getting blocked/affected by Istio's sidecar.
This issue only presents itself across upgrades, I haven't seen this failure on clean installs.
Attempted fixes
I attempted a number of ways to fix/handle this:
- Adjusting the upgrade strategy to prevent too many consul members at once - seemed slightly better but still inconsistent with failures - probably placebo better
- Ignoring certain ports on the istio sidecar - tried a mix of things including all the consul ports, no effect
- mTLS PERMISSIVE - only thing that consistently worked
BigBang / NeuVector Version
This is across an update from 2.12.0 -> 2.13.1. NeuVector version change is from 2.4.5-bb.6 to 2.6.3-bb.0.