Create a PrometheusRule resource that fires alerts anytime there are missing defender pods. In other words the alert should fire anytime the number of nodes exceeds the number of defender pods.
I still see this being worthwhile. I might just need to think on it more but is this feasible as a gatekeeper/kyverno policy (is it even something we can "monitor" with those)? Or alternatively would it be more valuable to approach this with a grafana dashboard (or something similar), separate from our policy agent?
Good find - I think we'll probably just migrate this issue to a Kyverno issue. As far as I'm tracking we don't intend to continue improvements on Gatekeeper at this point, all enhancements should be for Kyverno only.
That external data source is interesting, haven't seen it used before but seems promising. This should be workable in kyverno land, with some edits to the issue.
@runyontr@michaelmcleroy This looks like a good candidate for scope update. Possible functionality is available that may require a spike to fully understand implications and whether or not to proceed with Kyverno, or use a different solution in the event that Kyverno cannot accomplish the intended behavior.
@brandt.w.keller I think it would be tough for a policy enforcer like Kyverno to do this. Since it ties into the admission controller, it can monitor entry and exit. Typically it only evaluates one resource at a time. For this, it would have to track whether each node has a defender. And, I don't think its designed to track aggregation of resources to know if it passes or fails.
The best we could do is maybe create a policy to validate the defender daemonset has no taints.
I think a better approach might be to enhance cluster auditor. We have always had plans for it to evaluate the cluster and check for this type of thing (e.g. no logging stack, default passwords, etc.). This would be a good item for cluster auditor to check and report to Grafana. For example, it could do a kubectl get pod and kubect get nodes to verify defenders are running on each pod.
Agreed, the more I thought about how it would fit in with the admission controller, the less confident I was for the placement of that.
Cluster auditor sounds like a good place to execute some of these checks. I need to do some reading, but otherwise sounds like modifying this issue at a minimum is still valid.
Switching this issue to be accomplished via cluster auditor would require a larger effort (possibly larger than just a single issue), unless I'm overthinking or misrepresenting things here.
Cluster auditor currently is an upstream project we are utilizing. The application functionality is essentially just a prometheus exporter for gatekeeper constraint/violation information. My suspicion is that we're probably going to be getting rid of cluster auditor (at least as it refers to the upstream app) along with our removal of gatekeeper.
So to make a new cluster auditor with this functionality we would need to work through something like this:
Writing "the code" to handle checking for defenders + exporting in a prometheus consumable format (and ideally making this code "good"/extensible for other types of auditing)
Building this into an image to run through IB, getting said image approved
Building this into a new package in BB (adhering to all the package standards)
Definitely think that we want to get to this point (providing "custom" auditing solutions for end users), but it might need to be as part of separate epic for cluster auditor refactor/enhancements.
I talked a little with @runyontr about this. He said we could accomplish this with the data already being scraped. We just need to make alert manager flag when the number of nodes > number of pods in the twistlock daemonset. So, the scope would be for the Monitoring stack.
I think that makes sense, this would just become a new prometheusrule resource contained in the Twistlock repo/deployment. Definitely feasible in that context.