Attention Iron Bank Customers: On March 27, 2025, we are moving SBOM artifacts from the Anchore Scan job to the Build job to streamline the container hardening pipeline. If you currently download SBOMs from the Anchore Scan job, you can still get them from the Build job and from other sources, including IBFE and image attestations.
Review https://istio.io/latest/docs/setup/upgrade/canary/ and identify if Big Bang can either document or provide a method to upgrade Istio using a canary deployment. Based on the investigation, create additional issues to implement a solution.
Designs
Child items
...
Show closed items
Linked items
0
Link issues together to show that they're related or that one is blocking others.
Learn more.
While testing the introduction of a canary field to istiooperator, I did not get the expected outcome. I create an istiooperator as usual:
istiooperator: enabled: true ...
Flux creates the gitrepo, hr, etc.. The operator runs as expected and the hr/operator surfaces a "Ready=True" status.
I then simulate a canary upgrade by updating istiooperator to include a canary:
istiooperator:... canary: revision: "1-10-3" ...
Flux creates the gitrepo, hr, etc. for the "1-10-3" operator. Both operators run as expected and both hr's surface a "Ready=True" status.
The final step in the canary upgrade is to remove the initial operator. This is accomplished by removing the canary field and specifying the same revision, repo, tag, etc. in istiooperator as was specified in canary:
istiooperator: revision: "1-10-3" ...
This causes Flux to update the initial operator and remove the canary operator instead of removing the initial operator and using the new "1-10-3" operator. For additional details see the following:
As we discussed during our 8/18 meeting, I think a map[string]{} is an approach that will provide the desired outcome. A user will create istiooperators as such:
istiooperators: default: # This name is appended to Istio Operator resources. enabled: true ...
A second operator is created with a revision to canary upgrade:
I have successfully tested the map[string]{} approach. I create a named instance of istiooperators:
istiooperators: default: enabled: true git: ...
Note: The "default" istiooperator is disabled by default. This allows the definitions for the "default" operator to be completely removed from the customer template post canary upgrade.
Flux creates the gitrepo, hr, etc.. The operator runs as expected and the hr/operator surfaces a "Ready=True" status.
I then simulate a canary upgrade by adding another operator to istiooperators:
Flux creates the gitrepo, hr, etc. for the "canary" operator appending "1-10-3" to operator resource names. If revision is undefined for the "canary" operator, the operator name is append instead. Both operators run as expected and both hr's surface a "Ready=True" status.
The final step in the canary upgrade is to remove the "default" operator:
We shouldn't have canary then be the default/deployed instance of Istio, that'll be confusing to people and wont let them get the upgrades we plan for.
What's the issue with having the default object roll to the proposed (canary) version when that is accepted by the end user?
istio:git:tag:1.10.3# sets to revision in the `istio` HR. Manually set to current version before the upgradecanary:enabled:truegit:tag:1.10.4# ports to the revision `istio-canary`
At this point the admins can verify functionality and when they're ready:
```yamlistio:canary:enabled:false# would be default, but shown here to be explicit
would then roll the istio HR to the new/proposed revision?
We shouldn't have canary then be the default/deployed instance of Istio, that'll be confusing to people and wont let them get the upgrades we plan for.
@runyontrdefault is the default istio operator for the map[string]{} approach and not canary. canary is a unique key in the istiooperators map that represents an Istio Operator.
would then roll the istio HR to the new/proposed revision?
@runyontr the canary approach does not work as intended during the last step of the canary upgrade (removing the initial Istio Operator). Let's walk through an upgrade using this approach.
Flux creates the 1.10.3 Istio Operator repo, hr, etc. and all works as usual. BB is now running two Istio Operators, the initial operator, i.e. 1.9.7, and the new 1.10.3 operator. Note: The existing Istio Operator resources are unmodified by Flux.
The user upgrades the Istio data-plane, i.e. sets revision label, bounces workloads, etc. and now it's time to uninstall the initial Istio Operator. The user updates the template to transition the canary values to istiooperator and remove the canary, for example:
This causes Flux to remove the canary Istio Operator and update the initial Istio Operator to 1.10.3 which is not the desired behavior. We want to keep the canary Istio Operator running and uninstall the initial Istio Operator. A potential workaround to this issue is to disable the initial Istio Operator and keep the canary Istio Operator running, for example:
Even if canary supports enabled: true/false this causes a degraded UI.
How are subsequent upgrades handled? Now canary is the only Istio Operator and the primary istiooperator is now used to perform the next canary upgrade? Again, this provides a degraded UI.
This causes Flux to update the initial operator and remove the canary operator instead of removing the initial operator and using the new "1-10-3" operator.
@runyontr IMO it's a problem for the following reasons:
It's not a true canary upgrade. A canary creates a new instance of X, let's say X2, while X continues to serve production traffic. Traffic is incrementally shifted over to X2. After all traffic is shifted to X2, X is removed.
It would differ from the process documented upstream.
@runyontr thank you for the additional insight into the approach you're considering. The approach you provide has an issue during the remove canary step. While both old and new versions are running, the canary should not be removed since it's now managing the data-plane proxies. A potential solution is to update the old with the new revision, image ref, etc. to match the canary and then remove the canary. This would mean the cluster has 2 active control-planes running the same version/revision until the canary is removed. Although I have not tested this, I anticipate issues and I expect this not to be supported upstream.
Note that enabled: true is being used here for brevity and is unneeded since these pkgs are enabled by default.
A user upgrades BB to a version that supports Istio canary upgrades. The istio/istiooperator fields exist for backwards compatibility. Assuming the new BB version bumps the default Istio image tags and the user is fine with an in-place Istio upgrade, the istio/istiooperator fields from step 1 are left unchange. If the user prefers to canary upgrade, then a tag should be used to specify the previous pkg versions used in step 1 before bumping BB. For example:
Now that BB has been upgraded and the git tag has been added to ensure Istio stays at the previous release, another istio/istiooperator can be created using the new istios and istiooperators fields that specify the new Istio release supported by BB:
Note that step 2 and 3 can be combined so the 1-10-3 instances of istio and istiooperator are created during the BB upgrade. The BB cluster now has 2 Istio control-planes. All proxies except gateways are still using the 1.9.7 Istio control-plane.
The user upgrades the Istio data-plane by rolling the existing workload deployments, i.e. kubectl rollout -n $NS_NAME restart deployment. After all proxies are using the 1.10.3 Istio control-plane, the 1.9.7 control-plane and operator can be removed, for example:
Fast forward to the future and a new release of BB is available. The user will upgrade BB but istio/istiooperator will not get bumped since tag: 1.10.3-bb.0 is specified for both. After the user ensures the rest of BB is working as expected using the new release, it's time to start the Istio canary upgrade. The user creates a new instance of both that reference the Istio version supported by the newly installed BB release, for example:
This causes a new 1.11.0 Istio operator and control-plane to be created. The BB cluster now has 2 Istio operators and control-planes. At this point all proxies except gateways are still using the 1.10.3 control-plane. It's time to remove the 1.10.3 operator and control-plane by rolling the workload deployments and verifying e2e functionality. This is accomplished by removing the 1-10-3 instances of the operator and control-plane:
From step 4. and beyond, is the expection that the values are:
istio:enabled:falseistiooperator:enabled:false
This will cause some complexity issues as we try and determine if istio is enabled in the cluster or not to let the other packages know.
Upgrades:
After step 4 occurs, it's not clear how an end user would attempt an istio upgrade without first reading the Release notes and manually copying the new region for the upgrade Istio version. I think this creates opportunities for people to either:
From step 4. and beyond, is the expection that the values are:
istio:enabled:falseistiooperator:enabled:false
@runyontr yes, this is the expectation while istio and istiooperator are being deprecated. As part of the deprecation plan, we could set enabled: false by default for istio and istiooperator to address this issue. What is the BB policy for updating default values?
This will cause some complexity issues as we try and determine if istio is enabled in the cluster or not to let the other packages know.
Yes, add'l code complexity is required to support backwards compatibility but that should be expected. I agree that we must reduce the user-facing complexity as much as possible. The backwards compatibility code and istio/istiooperator fields should be removed when deprecation is complete.
After step 4 occurs, it's not clear how an end user would attempt an istio upgrade without first reading the Release notes and manually copying the new region for the upgrade Istio version.
An upgrade guide is required for this issue to be considered feature complete. Performing Istio canary upgrades will be more complex on users due to the API changes, i.e. istio > istios. Since the new API types are similar to the existing types, my hope is that users will quickly adjust their mental model and future upgrades will be less complex post deprecation.
@runyontr@michaelmcleroy I successfully completed a flux-driven Istio 1.9.7 > 1.10.3 Istio upgrade using the map[string]{} approach. I did uncover an issue that requires discussion. During the Istio control plane upgrade, the Istio ingressgateway is upgraded in-place which is expected and consistent with the upstream process. This means attached Gateway resources are now using the new version, which again is consistent with the upstream process. However, from a Flux perspective a Gateway is associated to a particular HelmRelease. If two HelmReleases specify the same Gateway, the HelmRelease that currently manages the Gateway, deletes it and the Gateway is created by the other HelmRelease. This Gateway "recreate" causes services exposed by the Gateway to be unavailable during the delete/create, xDS sync, etc. In my testing, this is typically 30-seconds but still not ideal.
From my understanding, there is no way to share intent across HelmReleases. One potential approach to workaround this limitation is to abstract gateway from the Istio HelmRelease into its own HelmRelease. Let me know your thoughts when you have a moment. In the meantime, I'll start working through the backwards compatibility details.
@michaelmcleroyhere is my repo for the istio-gateway pkg needed to fix the above issue. I can use assistance in forking the repo to Repo1. Note: The istio-controlplane pkg will continue to include gateway management functionality for backwards compatibility support and should be removed post deprecation.
Reviewed refactored upgrade guide from the github gist, and confirmed it worked fine through update and rollback when NetworkPolicies for new revision appear in namespace, initially ran into issue with that so pods and connections were failing until I disabled/re-enabled the revision.
After discussion with the team, the canary work was not approved for implementation into Big Bang for the following risk areas:
Added complexity in values to configure multiple operators, gateways, and control planes would cause confusion to users not using Canary upgrades
Canary upgrade has been requested by a single customer, who was lukewarm on the solution provided
If something goes wrong in the canary upgrade or rollback, troubleshooting and fixing the problems can result in extended downtime
Small risk that the old or new version of Istio's resources (e.g. HelmReleases, GitRepositories, VirtualServices, ConfigMaps, etc.) are not compatible with the single version of Flux or the new/old version of the control plane
Incompatibilities between Envoy proxies between two Istio versions can cause loss of communication between applications that are using different control planes. (e.g. Jaeger needs to communicate to the Logging stack to become ready if both are enabled).
We also explored manually doing a Canary upgrade, outside of Big Bang. This can be done, but it is not recommended for the following reasons:
To avoid downtime, the ownership of resources created manually would need to be "transferred" to Big Bang using a series of patches for annotations/labels. This process is risky and has no guardrails to protect from mistakes.
If something goes wrong in the canary upgrade or rollback, troubleshooting and fixing the problems can result in extended downtime
You need to copy Big Bang's Helm template and reduce it to deploy just deploy Istio. There are no guardrails for this process to avoid problems.
Our final recommendation is for users to test the new version of Istio in a staging environment for production to understand the implications of the upgrade and potential downtime before planning the upgrade. In-place upgrade is the recommended approach.
Complex workflows such as a canary upgrades/rollbacks can be greatly simplified through an operator pattern. Many platforms have been successful following this approach and should be considered for the evolution of Big Bang.
Closing this issue as final decision from BigBang anchors is that this pattern will be too complicated for the average customer to implement/utilize but doesn't mean all this work by the Tetrate team wasn't amazing and was useful for more advanced users of BigBang!
Will re-evaluate istio canary upgrades within BigBang when the BigBang Operator is implemented in the (far) future per the last comment by Daneyon.