NeuVector upgrade controller crashes with mTLS STRICT

I'm deploying on RKE2 but have also seen this on k3d and EKS. Minimal values to reproduce - just enables istio and neuvector:

kiali:
  enabled: false
tempo:
  enabled: false
kyverno:
  enabled: false
kyvernoPolicies:
  enabled: false
kyvernoReporter:
  enabled: false
promtail:
  enabled: false
loki:
  enabled: false
monitoring:
  enabled: false
grafana:
  enabled: false
neuvector:
  values:
    k3s:
      enabled: true # Adjust if on a different distro than k3d/rke2/k3s

I've tested with an additional peerauth to just set one port to permissive:

apiVersion: "security.istio.io/v1beta1"
kind: PeerAuthentication
metadata:
  name: controller-neuvector
  namespace: neuvector
spec:
  selector:
    matchLabels:
      app: neuvector-controller-pod
  mtls:
    mode: STRICT
  portLevelMtls:
    "18300":
      mode: PERMISSIVE

Across 5+ upgrades this was 100% consistently working. This would set the mTLS mode to permissive for the consul port 18300. This seems like a viable path...currently BB does permissive exceptions for other ports on things like elastic where internal clustering comms would get disrupted by istio. I'm still testing other possible fixes like changes to the 18300 port naming/protocol in the headless service, but so far nothing else is working.

Not sure why it isn't mirroring but I made a github PR to resolve this hopefully - https://github.com/DoD-Platform-One/Neuvector/pull/2 - the diffs are a lot because the github clone of this repo is out of date. It's really just 2 templates changed + chart version/changelog/readme stuff.

set weight to 3

changed iteration to Big Bang Iterations Oct 17, 2023 - Oct 30, 2023

changed iteration to Big Bang Iterations Oct 31, 2023 - Nov 13, 2023

added priority2 label

added kindbug label

The issue seems to happen far less often on k3d which isn't ideal for reproducibility for BB since I know you guys are doing CI/dev on k3d. I have definitely seen it almost 100% consistently on RKE2 and EKS though.

added community-contribution label

assigned to @dhilgaertner2

@micah.nagel My first attempt to reproduce on k3d didn't appear to work. A few questions though; what should I be looking for? Are these logs found in the enforcer pod logs? (I also see pods for manager, controller, and scanner). Should I be expecting a lot of restarts when this issues happens? After I upgraded from 2.12.0 to 2.13.0; I noticed that some new pods came up but no fatal errors in logs and no restarts.

The logs pasted above are from the controller pods
In my experience it was 2+ restarts most times I saw it, upwards of 6 or so at times
It's been far more consistent of an issue for me on RKE2. Not sure why/timing wise what is happening but that was where I saw the issue most often, almost 100% consistent.

@micah.nagel thanks! I've been clean installing 2.12.0 then upgrading to 2.13.1 and checking logs. I'll keep doing this till I reproduce.

What are your thoughts on the PERMISSIVE solution? Is this the solution we want to implement or are we trying to avoid that?

@micah.nagel I'm seeing this on one attempt:

2023-11-08T15:29:02.248|INFO|CTL|cluster.(*consulMethod).Start: Consul start - args=[agent -datacenter neuvector -data-dir /tmp/neuvector -server -bootstrap-expect 4 -config-file /tmp/neuvector/consul.json -bind 10.42.2.11 -advertise 10.42.2.11 -node 10.42.2.11 -node-id 59d84928-5ac2-8338-475f-ed3cac026d65 -raft-protocol 3 -retry-join 10.42.0.8 -retry-join 10.42.3.10 -retry-join 10.42.1.5]
==> Starting Consul agent...
           Version: '1.11.11'
           Node ID: '59d84928-5ac2-8338-475f-ed3cac026d65'
         Node name: '10.42.2.11'
        Datacenter: 'neuvector' (Segment: '<all>')
            Server: true (Bootstrap: false)
       Client Addr: [127.0.0.1] (HTTP: 8500, HTTPS: -1, gRPC: -1, DNS: -1)
      Cluster Addr: 10.42.2.11 (LAN: 18301, WAN: -1)
           Encrypt: Gossip: true, TLS-Outgoing: true, TLS-Incoming: true, Auto-Encrypt-TLS: false

==> Log data will now stream in as it occurs:

2023-11-08T15:29:02.436Z [ERROR] agent.server: Member has a conflicting expect value. All nodes should expect the same number.: member="{10.42.1.5 10.42.1.5 18301 map[acls:0 build:1.11.11:8a6d4151 dc:neuvector expect:2 ft_fs:1 ft_si:1 id:30890f45-fe70-9757-52e7-cd943139ab04 port:18300 raft_vsn:3 role:consul segment: use_tls:1 vsn:2 vsn_max:3 vsn_min:2] alive 1 5 2 2 5 4}"
2023-11-08T15:29:02.437Z [ERROR] agent.server: Member has a conflicting expect value. All nodes should expect the same number.: member="{10.42.1.5 10.42.1.5 18301 map[acls:0 build:1.11.11:8a6d4151 dc:neuvector expect:2 ft_fs:1 ft_si:1 id:30890f45-fe70-9757-52e7-cd943139ab04 port:18300 raft_vsn:3 role:consul segment: use_tls:1 vsn:2 vsn_max:3 vsn_min:2] alive 1 5 2 2 5 4}"
2023-11-08T15:29:02.437Z [ERROR] agent.server: Member has a conflicting expect value. All nodes should expect the same number.: member="{10.42.1.5 10.42.1.5 18301 map[acls:0 build:1.11.11:8a6d4151 dc:neuvector expect:2 ft_fs:1 ft_si:1 id:30890f45-fe70-9757-52e7-cd943139ab04 port:18300 raft_vsn:3 role:consul segment: use_tls:1 vsn:2 vsn_max:3 vsn_min:2] alive 1 5 2 2 5 4}"
2023-11-08T15:31:14.311|INFO|CTL|utils.SetReady: - value=ctrl init done
2023-11-08T15:31:14.311|ERRO|CTL|main.clusterStart: Cluster failed - error=Failed to elect leader waited=5m0s
2023-11-08T15:31:14|MON|Process ctrl exit status 255, pid=7
2023-11-08T15:31:18.020Z [ERROR] agent.anti_entropy: failed to sync remote state: error="No cluster leader"
2023-11-08T15:31:18.020Z [ERROR] agent: Coordinate update error: error="No cluster leader"
Graceful leave complete

The controller ultimately terminates sometime after this.

After some time they come up successfully.

@micah.nagel perhaps I've just been unlucky but I haven't reproduced it after 6 attempts. Still trying some more at the moment.

@micah.nagel how would we proceed from here? I've yet to successfully reproduce the issue.

@dhilgaertner2 the error you see there is one of the ones I see but the one that typically causes a full failure is bind: address already in use. As mentioned this is far more common in RKE2, not sure why but if you're looking to reproduce it might be worth testing on an RKE2 cluster.

The permissive solution seems workable - I'm deploying that in my environments because of the issue. It's only on a single port for NeuVector but it would be good to debug why that is. I had a chance to mention this to the NeuVector maintainers and need to follow up on it with them.