Attention Iron Bank Customers: On March 27, 2025, we are moving SBOM artifacts from the Anchore Scan job to the Build job to streamline the container hardening pipeline. If you currently download SBOMs from the Anchore Scan job, you can still get them from the Build job and from other sources, including IBFE and image attestations.
Project 'platform-one/big-bang/bigbang' was moved to 'big-bang/bigbang'. Please update any links and bookmarks that may still have the old path.
Review https://istio.io/latest/docs/setup/upgrade/canary/ and identify if Big Bang can either document or provide a method to upgrade Istio using a canary deployment. Based on the investigation, create additional issues to implement a solution.
Designs
Child items
...
Show closed items
Linked items
0
Link issues together to show that they're related or that one is blocking others.
Learn more.
While testing the introduction of a canary field to istiooperator, I did not get the expected outcome. I create an istiooperator as usual:
istiooperator: enabled: true ...
Flux creates the gitrepo, hr, etc.. The operator runs as expected and the hr/operator surfaces a "Ready=True" status.
I then simulate a canary upgrade by updating istiooperator to include a canary:
istiooperator:... canary: revision: "1-10-3" ...
Flux creates the gitrepo, hr, etc. for the "1-10-3" operator. Both operators run as expected and both hr's surface a "Ready=True" status.
The final step in the canary upgrade is to remove the initial operator. This is accomplished by removing the canary field and specifying the same revision, repo, tag, etc. in istiooperator as was specified in canary:
istiooperator: revision: "1-10-3" ...
This causes Flux to update the initial operator and remove the canary operator instead of removing the initial operator and using the new "1-10-3" operator. For additional details see the following:
As we discussed during our 8/18 meeting, I think a map[string]{} is an approach that will provide the desired outcome. A user will create istiooperators as such:
istiooperators: default: # This name is appended to Istio Operator resources. enabled: true ...
A second operator is created with a revision to canary upgrade:
I have successfully tested the map[string]{} approach. I create a named instance of istiooperators:
istiooperators: default: enabled: true git: ...
Note: The "default" istiooperator is disabled by default. This allows the definitions for the "default" operator to be completely removed from the customer template post canary upgrade.
Flux creates the gitrepo, hr, etc.. The operator runs as expected and the hr/operator surfaces a "Ready=True" status.
I then simulate a canary upgrade by adding another operator to istiooperators:
Flux creates the gitrepo, hr, etc. for the "canary" operator appending "1-10-3" to operator resource names. If revision is undefined for the "canary" operator, the operator name is append instead. Both operators run as expected and both hr's surface a "Ready=True" status.
The final step in the canary upgrade is to remove the "default" operator:
We shouldn't have canary then be the default/deployed instance of Istio, that'll be confusing to people and wont let them get the upgrades we plan for.
What's the issue with having the default object roll to the proposed (canary) version when that is accepted by the end user?
istio:git:tag:1.10.3# sets to revision in the `istio` HR. Manually set to current version before the upgradecanary:enabled:truegit:tag:1.10.4# ports to the revision `istio-canary`
At this point the admins can verify functionality and when they're ready:
```yamlistio:canary:enabled:false# would be default, but shown here to be explicit
would then roll the istio HR to the new/proposed revision?
We shouldn't have canary then be the default/deployed instance of Istio, that'll be confusing to people and wont let them get the upgrades we plan for.
@runyontrdefault is the default istio operator for the map[string]{} approach and not canary. canary is a unique key in the istiooperators map that represents an Istio Operator.
would then roll the istio HR to the new/proposed revision?
@runyontr the canary approach does not work as intended during the last step of the canary upgrade (removing the initial Istio Operator). Let's walk through an upgrade using this approach.
Flux creates the 1.10.3 Istio Operator repo, hr, etc. and all works as usual. BB is now running two Istio Operators, the initial operator, i.e. 1.9.7, and the new 1.10.3 operator. Note: The existing Istio Operator resources are unmodified by Flux.
The user upgrades the Istio data-plane, i.e. sets revision label, bounces workloads, etc. and now it's time to uninstall the initial Istio Operator. The user updates the template to transition the canary values to istiooperator and remove the canary, for example:
This causes Flux to remove the canary Istio Operator and update the initial Istio Operator to 1.10.3 which is not the desired behavior. We want to keep the canary Istio Operator running and uninstall the initial Istio Operator. A potential workaround to this issue is to disable the initial Istio Operator and keep the canary Istio Operator running, for example:
Even if canary supports enabled: true/false this causes a degraded UI.
How are subsequent upgrades handled? Now canary is the only Istio Operator and the primary istiooperator is now used to perform the next canary upgrade? Again, this provides a degraded UI.
This causes Flux to update the initial operator and remove the canary operator instead of removing the initial operator and using the new "1-10-3" operator.
@runyontr IMO it's a problem for the following reasons:
It's not a true canary upgrade. A canary creates a new instance of X, let's say X2, while X continues to serve production traffic. Traffic is incrementally shifted over to X2. After all traffic is shifted to X2, X is removed.
It would differ from the process documented upstream.
@runyontr thank you for the additional insight into the approach you're considering. The approach you provide has an issue during the remove canary step. While both old and new versions are running, the canary should not be removed since it's now managing the data-plane proxies. A potential solution is to update the old with the new revision, image ref, etc. to match the canary and then remove the canary. This would mean the cluster has 2 active control-planes running the same version/revision until the canary is removed. Although I have not tested this, I anticipate issues and I expect this not to be supported upstream.
Note that enabled: true is being used here for brevity and is unneeded since these pkgs are enabled by default.
A user upgrades BB to a version that supports Istio canary upgrades. The istio/istiooperator fields exist for backwards compatibility. Assuming the new BB version bumps the default Istio image tags and the user is fine with an in-place Istio upgrade, the istio/istiooperator fields from step 1 are left unchange. If the user prefers to canary upgrade, then a tag should be used to specify the previous pkg versions used in step 1 before bumping BB. For example:
Now that BB has been upgraded and the git tag has been added to ensure Istio stays at the previous release, another istio/istiooperator can be created using the new istios and istiooperators fields that specify the new Istio release supported by BB:
Note that step 2 and 3 can be combined so the 1-10-3 instances of istio and istiooperator are created during the BB upgrade. The BB cluster now has 2 Istio control-planes. All proxies except gateways are still using the 1.9.7 Istio control-plane.
The user upgrades the Istio data-plane by rolling the existing workload deployments, i.e. kubectl rollout -n $NS_NAME restart deployment. After all proxies are using the 1.10.3 Istio control-plane, the 1.9.7 control-plane and operator can be removed, for example:
Fast forward to the future and a new release of BB is available. The user will upgrade BB but istio/istiooperator will not get bumped since tag: 1.10.3-bb.0 is specified for both. After the user ensures the rest of BB is working as expected using the new release, it's time to start the Istio canary upgrade. The user creates a new instance of both that reference the Istio version supported by the newly installed BB release, for example:
This causes a new 1.11.0 Istio operator and control-plane to be created. The BB cluster now has 2 Istio operators and control-planes. At this point all proxies except gateways are still using the 1.10.3 control-plane. It's time to remove the 1.10.3 operator and control-plane by rolling the workload deployments and verifying e2e functionality. This is accomplished by removing the 1-10-3 instances of the operator and control-plane:
From step 4. and beyond, is the expection that the values are:
istio:enabled:falseistiooperator:enabled:false
This will cause some complexity issues as we try and determine if istio is enabled in the cluster or not to let the other packages know.
Upgrades:
After step 4 occurs, it's not clear how an end user would attempt an istio upgrade without first reading the Release notes and manually copying the new region for the upgrade Istio version. I think this creates opportunities for people to either:
From step 4. and beyond, is the expection that the values are:
istio:enabled:falseistiooperator:enabled:false
@runyontr yes, this is the expectation while istio and istiooperator are being deprecated. As part of the deprecation plan, we could set enabled: false by default for istio and istiooperator to address this issue. What is the BB policy for updating default values?
This will cause some complexity issues as we try and determine if istio is enabled in the cluster or not to let the other packages know.
Yes, add'l code complexity is required to support backwards compatibility but that should be expected. I agree that we must reduce the user-facing complexity as much as possible. The backwards compatibility code and istio/istiooperator fields should be removed when deprecation is complete.
After step 4 occurs, it's not clear how an end user would attempt an istio upgrade without first reading the Release notes and manually copying the new region for the upgrade Istio version.
An upgrade guide is required for this issue to be considered feature complete. Performing Istio canary upgrades will be more complex on users due to the API changes, i.e. istio > istios. Since the new API types are similar to the existing types, my hope is that users will quickly adjust their mental model and future upgrades will be less complex post deprecation.
@runyontr@michaelmcleroy I successfully completed a flux-driven Istio 1.9.7 > 1.10.3 Istio upgrade using the map[string]{} approach. I did uncover an issue that requires discussion. During the Istio control plane upgrade, the Istio ingressgateway is upgraded in-place which is expected and consistent with the upstream process. This means attached Gateway resources are now using the new version, which again is consistent with the upstream process. However, from a Flux perspective a Gateway is associated to a particular HelmRelease. If two HelmReleases specify the same Gateway, the HelmRelease that currently manages the Gateway, deletes it and the Gateway is created by the other HelmRelease. This Gateway "recreate" causes services exposed by the Gateway to be unavailable during the delete/create, xDS sync, etc. In my testing, this is typically 30-seconds but still not ideal.
From my understanding, there is no way to share intent across HelmReleases. One potential approach to workaround this limitation is to abstract gateway from the Istio HelmRelease into its own HelmRelease. Let me know your thoughts when you have a moment. In the meantime, I'll start working through the backwards compatibility details.