Release 2.45.0
Release Engineering
Expectations for Release Engineer
- A regularly scheduled release takes place Tuesday-Friday every other week at the end of a sprint.
- Do not sign up for release engineer if you have any planned time off during Tues-Fri during the release week.
- Testing on the Dogfood cluster should be complete by end of day Wednesday.
- The release should be mostly complete by end of day Thursday. (No one wants to be troubleshooting release things on a Friday at 5 pm)
- In order to expedite the release process, once a package upgrade is verified to have broken either the upgrade or the package UI functionality, the SOP will be to immediately back out the change for that package. usine the Package Rejection Process
- join the Il4 Mattermost Big Bang Release Engineering Channel. This group chat will allow the Maintainers 1 to support you. This group also provides the Maintainers 1 with visibility on where we are in the release process.
- The RE should seek anchor’s approval prior to starting step 1. Maintainers tend to push through last minute MRs on Tuesday morning. We want to get any finished work into the release wherever possible.
- If a shadow is assigned, hop in a Zoom breakout room and pair to complete the release process.
Shadowing
- Team members should shadow a release engineer prior to becoming RE. Assigned REs should have a solid understanding of the release process prior to their turn and have shadowed enough to understand and be able to execute the release process.
- Try to clarify any questions about the release process steps prior to becoming the actual release engineer.
- Review the release process steps before it's your turn. As always you can reach out to the Maintainers 1 with questions about the release process.
- Sync regularly with the release engineer to understand any issues/blockers that they run into. Likely you'll run into similar issues or blockers.
Prerequisites
Important: Make sure that you have met all prerequisites before the day you're scheduled to start working on a release in case you run into any problems.
- Install or update all the tools listed in this Big Bang Training document
- Install or update R2-D2 according to the instructions in the R2-D2 README
-
Install or update
gogru
according to the instructions in the gogru README - Request access to and/or ensure you can connect to the Dogfood cluster according to the instructions in the Dogfood README
- Open Default Application Credentials
- Have your P1 username, password and MFA device
Release Process
Forward
- DO NOT blindly copy and paste commands prepended with
$
. Carefully read and evaluate these commands and execute them if applicable. - This upgrade process is applicable only to minor Big Bang version upgrades of Big Bang major version
2
. It is subject to change in future major versions. Patch version upgrades follow a different process. - Use the Gogru release automation tool along with
R2D2
to complete the release. CurrentlyGogru
only supports the1. Release Prep
and2. Upgrade and Debug Cluster
parts of the release process. Follow the README for instructions on how to clone and installGogru
. - Use the R2D2 release automation tool to automate certain parts of this release process. Follow the README for instructions on how to clone and install R2D2. Usage of R2D2 is denoted in the document by the _R2-D2_ prefix.
- The release branch format is as follows:
release-<major>.<minor>.x
where<minor>
is the minor release number. Example:release-2.7.x
. The<minor>
string is a placeholder for the minor release number in this document. - The release tag format for major version 2 of Big Bang is:
2.<minor>.0
.
NOTE: As you work through the release make a list of pain points / unclear steps. Once the release is complete, provide this feedback to the maintainers via MM and/or an MR to update the documentation. This allows us to continuously improve this process for future release engineers.
1. Release Prep
-
Create the Bigbang (product) Issue
./gogru createIssue
-
Verify the Dogfood cluster is up and running normally
./gogru dogfoodCheck
- If there are any issues with the Helm Releases or Pods in the output, investigate and fix them before continuing
-
Create the release branches for Dogfood and BigBang (product), and verify the hash of the release branch matches the master branch
./gogru createRB
-
Update BigBang (product) version references with the new release version
./gogru update-version-references
2. Upgrade and Debug Cluster
WARNING: Upgrade only, do not delete and redeploy.
-
Check and Renew ElasticSearch (ECK) License
./gogru eck-license-check
-
Check and Renew Mattermost License
./gogru mm-license-check
-
Upgrade Flux
./gogru update-flux
-
Upgrade Dogfood
./gogru upgrade-dogfood
-
After running the
upgrade-dogfood
command, an MR will be opened in your name. Have the Maintainers 1 merge your branch intomaster
. Once merged,flux
should pick up the changes and begin reconcile the helm releases. This may take up to an hour depending on the size of the release. -
As an extra check, you can watch the Release helmreleases, gitrepositories, pods, and kustomizations to check when all HRs have properly reconciled. Gogru's dogfoodCheck only looks at the HR's and Pods for now.
watch kubectl get gitrepositories,kustomizations,hr,po -A
IF FLUX HAS NOT UPDATED AFTER 10 MINUTESShell Instructions to force Flux to reconcile
```shell flux reconcile hr -n bigbang bigbang --with-source --force ``` - If flux is still not updating, delete the flux source controller:kubectl delete pod -n flux-system -l app=source-controller
- If the helm release shows max retries exhausted, check a describe of the HR. If it shows "another operation (install/upgrade/rollback) is in progress", this is an issue caused by too many Flux reconciles happening at once and typically the Helm controller crashing. You will need to delete helm release secrets and reconcile in flux as follows. Note that ${HR_NAME} is the same HR you described which is in a bad state (typically a specific package and NOT bigbang itself).
# example w/ kiali $ HR_NAME=kiali $ kubectl get secrets -n bigbang | grep ${HR_NAME}
# example output: # some hr names are duplicated w/ a dash in the middle, some are not sh.helm.release.v1.kiali-kiali.v1 helm.sh/release.v1 1 18h sh.helm.release.v1.kiali-kiali.v2 helm.sh/release.v1 1 17h sh.helm.release.v1.kiali-kiali.v3 helm.sh/release.v1 1 17m
# Delete the latest one: $ kubectl delete secret -n bigbang sh.helm.release.v1.${HR_NAME}-${HR_NAME}.v3 # suspend/resume the hr $ flux suspend hr -n bigbang ${HR_NAME} $ flux resume hr -n bigbang ${HR_NAME}
- If you see errors about no space left when you kubectl describe a failed deployment. The logs have probably filled the filesystem of the node. Determine which node the deployment was scheduled on and ssh to it and delete the logs. You need to have sshuttle running in order to reach the IP of the nodes.
ssh -i ~/.ssh/dogfood.pem ec2-user@xx.x.x.xx sudo -i rm -rf /var/log/containers/* rm -rf /var/log/pods/*
Then the deployment should recover on its own
-
-
Verify cluster has updated to the new release and verify that there are no outstanding issues with Pods or Helm Releases
./gogru dogfoodCheck
NOTE: IfPrometheus
does not start and is logging an error relating to Vault permissions, follow these instructions to reissue the Vault token. NOTE: If for some reason the upgrade fails catastrophically and the release needs to be rolled back, follow the rollback instructions. This should be used only as a last resort to re-test the upgrade process. HANDLING PACKAGE UPGRADE FAILURESFollow the Package Rejection Process to rollback the package upgrade
3. UI Tests
WARNING: if you encounter a package that is failing please follow the Package Rejection Process to rollback the package upgrade.
Default Application Credentials
Logging
-
Login to kibana with SSO.
NOTE: If you get "You do not have permission to access the requested page," follow the instructions under "Renew ECK Trial" in the 2. Upgrade and Debug Cluster section above. Don't forget to log in as admin and reconfigure role mapping after you run the kubectl and flux commands. -
Verify that Kibana is actively indexing/logging.
- To do this, click on the drop-down menu on the upper-left corner, then Under "Analytics" click Discover. Click "Create data view." In the "Index pattern" field, enter
jaeger*
. Set the "Timestamp field" toI don't want to use the time filter
, then click "Use without saving". Log data should populate. - If it is not indexing (you see no index for jaeger), you may have renewed the ElasticSearch license and jaeger did not reconcile after the renewal. Force jaeger to pick up the new configuration by doing:
flux reconcile hr -n bigbang jaeger --force --with-source
- To do this, click on the drop-down menu on the upper-left corner, then Under "Analytics" click Discover. Click "Create data view." In the "Index pattern" field, enter
Kiali
- Login to kiali with SSO
-
Validate graphs and traces are visible under applications/workloads
- Graphs at the overview
- Graphs for inbound metrics
- Graphs for outbound metrics
- Graphs for Traces
-
Validate no errors appear
-
NOTE: Red notification bell would be visible if there are errors. Errors on individual application listings for labels, etc are expected and OK.
-
Twistlock
-
Log in here
- In the Targets drop down choose
serviceMonitor/twistlock/twistlock-console/0
- Click the small "show more" button that appears. You should see a State of UP and no errors.
- In the Targets drop down choose
-
R2-D2:
Twistlock Test
Fortify
-
R2-D2:
Fortify Test
GitLab Sonarqube Gitlab Runner
NOTE: if gitlab has updated to v18 and pipelines are not running, need to follow this procedure to register the runner Runner Registration, right now this process works but v18 gets rid of that bypass capability and we will have to follow a new process like here.
-
Login to gitlab with SSO
-
Edit profile and change user avatar. If your avatar doesn't change right away, check again later. It can take several minutes or more.
-
Go to Security > Access Token and create a personal access token. Select all scopes. Record the token. You'll need it for the next two steps.
-
R2-D2:
Gitlab Test
Sonarqube
- Login to sonarqube with SSO
-
Verify your Project says
PASSED
Nexus
-
R2-D2:
Nexus Test
Minio
-
R2-D2:
Minio Test
Mattermost
-
R2-D2:
Mattermost Test
ArgoCD
-
R2-D2:
ArgoCD Test
Kyverno
-
R2-D2:
Kyverno Test
Velero
-
R2-D2:
Velero Test
-
NOTE: This test fails occasionally. You may need to run it multiple times to get it to pass.
Loki
-
R2-D2:
Loki Test
-
NOTE: theLoki / Operational
dashboard is currently missing and we have a task here to add it back in
-
Tempo
-
R2-D2:
Tempo Test
-
NOTE: this test is obsolete at the moment as we do not have any data tethered to tempo at the moment
-
Keycloak
-
R2-D2:
Keycloak Test
Neuvector
-
R2-D2:
Neuvector Test
Grafana & ArgoCD & Anchore & Mattermost SSO
-
R2-D2:
Native SSO Grafana, ArgoCD, Anchore and Mattermost Test
NOTE: Grafana OPA Violations dashboard showing "No Data" is a known issue. This has been skipped since 2.18.0, current open issue awaiting fix is here.
Monitoring & Cluster Auditor & Tracing (Jaeger) & Alertmanager
-
R2-D2:
authservice SSO Prometheus, Tracing, AlertManager, and Cluster Auditor Test
-
NOTE: Grafana OPA Violations dashboard showing "No Data" is a known issue. This has been skipped since 2.18.0, current open issue awaiting fix is here. -
NOTE: if vault shows up as UNHEALTHY, that means it was rebuilt and we need to update the permissions on thevault/secrets/token
file within the prometheus pod. (we have an issue with the permissions on the file.)- to fix this:
- exec into the
vault-agent
container on theprometheus-monitoring-monitoring-kube-prometheus-0
pod and run the following
chmod 644 /vault/secrets/token
- exec into the
- to fix this:
-
Anchore
-
log back in as the admin user - password is in
anchore-anchore-enterprise
secret (admin will have pull credentials set up for the registries):kubectl get secret anchore-anchore-enterprise -n anchore -o json | jq -r '.data.ANCHORE_ADMIN_PASSWORD' | base64 -d; echo ' <- password'
-
Scan image in dogfood registry,
registry.dogfood.bigbang.mil/GROUPNAMEHERE/PROJECTNAMEHERE/alpine:latest
-
Go to Images Page, Analyze Tag
- Registry:
registry.dogfood.bigbang.mil
- Repository:
GROUPNAME/PROJECTNAME/alpine
- Tag:
latest
- Registry:
-
Go to Images Page, Analyze Tag
-
Scan image in nexus registry,
containers.dogfood.bigbang.mil/alpine:<release-number>
(use your release number, ex:1-XX-0
). If authentication fails that means that the Nexus credentials have changed. Retrieve the Nexus credentials using instructions from Nexus above. Update creds in Anchore by clicking on menu Images > Analyze Repository > link at bottom "clicking here" > click on registry name > then edit the credentials. -
Validate scans complete and Anchore displays data (click the SHA value for each image)
4. Create Release
STOP: Before finalizing the release, confirm that the latest bb-docs-compiler pipeline passed. If not, identify, correct, and cherry-pick a fix into the release branch. This will prevent wasting time re-running pipelines if the docs compiler pipeline fails.
Prep
-
Re-run helm docs to ensure that on the latest version + latest package versions.
cd bigbang git pull # pull any last minute cherry picks, verify nothing has greatly changed git checkout release-2.<minor>.x docker run -v "$(pwd):/helm-docs" -u $(id -u) jnorwood/helm-docs:v1.5.0 -s file -t .gitlab/base_config.md.gotmpl --dry-run > ./docs/understanding-bigbang/configuration/base-config.md # commit and push the changes (if any)
Build The Release Notes
-
R2-D2: run with the
Build release notes
option selected -
Clean up release notes
- Check to see if any packages have been added or graduated from BETA. There is no single resource that lists this information, but you should generally be aware of packages that have been added or moved out of BETA from discussions with other team members during your workdays. You can also ask the Maintainers 1 if unsure. If any packages have been added or moved out of BETA, adjust your local R2D2 config as needed. Also open an MR for R2D2 with the change to the default config.
- Create Upgrade Notices, based off of the listed notices in the release issue. Also review with maintainers to see if any other major changes should be noted.
- Move MRs from 'Big Bang MRs' into their specific sections below. If MR is for Big Bang (docs, CI, etc) and not a specific package, they can remain in place under 'Big Bang MRs'. General rule of thumb: if the git tag changes for a package - put the MR under that package; and if there is no git tag change for a package - put it under BB.
- Adjust known issues as needed: If an issue has been resolved it can be removed and if any new issues were discovered they should be added.
- Verify table contains all chart upgrades mentioned in the MR list below.
- Verify table contains all application versions. Compare with previous release notes. If a package was upgraded there will usually be a bump in application version and chart version in the table.
- Verify all internal comments are removed from Release Notes, such as comments directing the release engineer to copy/move things around
- Any issues encountered during release upgrade testing need to be thoroughly documented in the release notes. If you had to delete a replica set, force restart some pods, delete a PV, please make sure these steps are included in our known issues section of the release notes.
-
Scroll through every package listed and verify the following
- it has a link to at least one MR
- the CHANGELOG entries match the MR listed
- if more than one MR or CHANGELOG entry for a package exists then verify it makes sense (ie if 2 MRs that there are CHANGELOG entries for both. Conversely, if more than one CHANGELOG entry make sure those entries are tied to the MRs in this release, sometimes changlogs get messed up causing more entries in the release notes than neccessary)
-
Publish Release Notes to Dogfood
-
Commit the resulting release notes and push to
<release-branch>
in Big Bang (dogfood) - Create an MR for the Maintainers 1 and have it merged before continuing on
-
Commit the resulting release notes and push to
Create Release Candidate Tag
-
Create release candidate tag based on release branch, ie.
2.<minor>.0-rc.0
. Tagging will additionally create release artifacts and the pipeline runs north of 1 hour. You will need to request tag permissions from the Maintainers 1.-
To do this via the UI (generally preferred): tags page -> new tag, name:
2.<minor>.0-rc.0
, create from:release-2.<minor>.x
, message: "release candidate", release notes: leave empty -
To do this via git CLI:
git tag -a 2.<minor>.0-rc.0 -m "release candidate" # list the tags to make sure you made the correct one git tag -n git push origin 2.<minor>.0-rc.0
-
- Passed pipeline for Release Candidate tag.
-
Passed pipeline in bb-docs-compiler for latest RC tag (Gets created/scheduled towards the end of the bigbang release pipeline).
Review all pipeline output looking for failures or warnings. Reach out to the maintainers for a quick review before proceeding. Maintainers 1
- Check the releases for release notes.
Create Release Candidate Tag
-
Create release tag based on release branch. ie.
2.<minor>.0
.-
To do this via the UI (generally preferred): tags page -> new tag, name:
2.<minor>.0
, create from:release-2.<minor>.x
, message:release 2.<minor>.0
, release notes: leave empty -
To do this via git CLI:
git tag -a 2.<minor>.0 -m "release 2.<minor>.0" # list the tags to make sure you made the correct one git tag -n # push git push origin 2.<minor>.0
-
-
Passed release pipeline.
Review all pipeline output looking for failures or warnings. Reach out to the maintainers for a quick review before proceeding. Maintainers 1 [ ]Ensure that the RC tag release created prior to this was deleted as well.
- Ensure Release Notes are present here
-
Create a reminder to ensure that release docs have compiled and appear on Big Bang Docs and they have been mirrored to the IL2 Big Bang repository the following day.
-
Expand the drop down in the title bar showing
latest
and confirm the current release is present. - Click around in the docs and make sure docs are properly formatted, pay special attention to lists and tables.
NOTE: BB-docs won't actually publish the latest release version until the nightly bb-docs job runs. You can check this on the following day. NOTE: You may require access be granted to the IL2 Big Bang repository to view the status. Contact the Maintainers 1 for assistance. -
Expand the drop down in the title bar showing
RELEASE IT
- Type up a release announcement for the Maintainers 1 to post in the Big Bang Value Stream channel. Mention any breaking changes and some of the key updates from this release. Use same format as a prior announcement https://chat.il2.dso.mil/platform-one/channels/team---big-bang
-
Cherry-pick release commit(s) as needed with merge request back to master branch. We do not ever merge release branch back to master. Instead, make a separate branch from master and cherry-pick the release commit(s) into it. Use the resulting MR to close the release issue.
# Navigate to your local clone of the BB repo cd path/to/my/clone/of/bigbang # Get the most up to date master git checkout master git pull # Check out new branch for cherrypicking git checkout -b 2.<minor>-cherrypick # Repeat the following for whichever commits are required from release # Typically this is the initial release commit (bumped GitRepo, Chart.yaml, CHANGELOG, README) and a final commit that re-ran helm-docs (if needed) git cherry-pick <commit sha for Nth commit> git push -u origin 2.<minor>-cherrypick # Create an MR using Gitlab, merging this branch back to master, reach out to maintainers to review
- Close Big Bang Milestone in GitLab. (make sure that there are n+2 milestones still, the current one and 1 future milestone, if not contact the maintainers 1)
- Handoff the release to the Maintainers 1, they will review then celebrate and announce the release in the public MM channel
- Reach out to @marknelson to let him know that the release is done and to update the Gitlab Banner