Release 2.43.0
Release Engineering
Expectations for Release Engineer
- A regularly scheduled release takes place Tuesday-Friday every other week at the end of a sprint.
- Do not sign up for release engineer if you have any planned time off during Tues-Fri during the release week.
- Testing on the Dogfood cluster should be complete by end of day Wednesday.
- The release should be mostly complete by end of day Thursday. (No one wants to be troubleshooting release things on a Friday at 5 pm)
- join the Il4 Mattermost Big Bang Release Engineering Channel. This group chat will allow the Maintainers 1 to support you. This group also provides the Maintainers 1 with visibility on where we are in the release process.
- The RE should seek anchor’s approval prior to starting step 1. Maintainers tend to push through last minute MRs on Tuesday morning. We want to get any finished work into the release wherever possible.
- If a shadow is assigned, hop in a Zoom breakout room and pair to complete the release process.
Shadowing
- Team members should shadow a release engineer prior to becoming RE. Assigned REs should have a solid understanding of the release process prior to their turn and have shadowed enough to understand and be able to execute the release process.
- Try to clarify any questions about the release process steps prior to becoming the actual release engineer.
- Review the release process steps before it's your turn. As always you can reach out to the Maintainers 1 with questions about the release process.
- Sync regularly with the release engineer to understand any issues/blockers that they run into. Likely you'll run into similar issues or blockers.
Prerequisites
Important: Make sure that you have met all prerequisites before the day you're scheduled to start working on a release in case you run into any problems.
-
Install or update all the tools listed in this Big Bang Training document -
Install or update R2-D2 according to the instructions in the R2-D2 README -
Install or update gogru
according to the instructions in the gogru README -
Request access to and/or ensure you can connect to the Dogfood cluster according to the instructions in the Dogfood README -
Have your P1 username, password and MFA device
Release Process
Forward
- DO NOT blindly copy and paste commands prepended with
$
. Carefully read and evaluate these commands and execute them if applicable. - This upgrade process is applicable only to minor Big Bang version upgrades of Big Bang major version
2
. It is subject to change in future major versions. Patch version upgrades follow a different process. - Use the Gogru release automation tool along with
R2D2
to complete the release. CurrentlyGogru
only supports the1. Release Prep
part of the release process. Follow the README for instructions on how to clone and installGogru
. - Use the R2D2 release automation tool to automate certain parts of this release process. Follow the README for instructions on how to clone and install R2D2. Usage of R2D2 is denoted in the document by the _R2-D2_ prefix.
- The release branch format is as follows:
release-<major>.<minor>.x
where<minor>
is the minor release number. Example:release-2.7.x
. The<minor>
string is a placeholder for the minor release number in this document. - The release tag format for major version 2 of Big Bang is:
2.<minor>.0
.
NOTE: As you work through the release make a list of pain points / unclear steps. Once the release is complete, provide this feedback to the maintainers via MM and/or an MR to update the documentation. This allows us to continuously improve this process for future release engineers.
1. Release Prep
📌 NOTE: If you need the manual steps and/or R2D2 steps, expand them at the end of this section.
-
Create the Bigbang (product) Issue
./gogru createIssue
-
Verify the Dogfood cluster is up and running normally
./gogru dogfoodCheck
- If there are any issues with the Helm Releases or Pods in the output, investigate and fix them before continuing
-
Create the release branches for Dogfood and BigBang (product), and verify the hash of the release branch matches the master branch
./gogru createRB
-
Update BigBang (product) version references with the new release version
./gogru update-version-references
R2D2 and Manual Steps (skip if using gogru commands)
-
Check the Big Bang Issues to see if a release issue has already been created. If none exists create one with the following values Field Value Name Release 2.<minor>.<patch>
Description copy/paste the Release Process section of this document Assignee $YOUR_NAME
Labels kind::chore
priority::2
status::doing
team::Tools & Automation
Weight 5
Iteration $CURRENT_ITERATION
-
Verify the dogfood cluster is up and running normally. -
Connect to the dogfood cluster
# Modify ~/.ssh/id_rsa to the path for your SSH key sshuttle --dns -vr ec2-user@$(aws ec2 describe-instances --filters Name=tag:Name,Values=dogfood2-bastion --output json | jq -r '.Reservations[0].Instances[0].PublicIpAddress') 192.168.28.0/24 --ssh-cmd 'ssh -i ~/.ssh/id_rsa'
-
Run
kubectl get hr -A
and verify all helm releases showReady True
- If helm releases not showing
Ready True
investigate and fix the issues before continuing with the release.
- If helm releases not showing
-
-
Make sure you have both Big Bang (product) and Big Bang (dogfood) checked out and updated -
Create release branch in Big Bang (product). This is done after the mid sprint check-in on Tuesday to ensure all required MRs for the release have been merged into Big Bang. You will need to request permissions from the maintainers 1 if you run into permissions issues. NOTE: the R2D2 tool will not show you an error if you do not have permission -
R2-D2: Run r2d2
in your shell inside the Big Bang (product) repository and select theCreate release branch
option
Manual Steps
-
In either the Gitlab UI or the Git CLI, create a new branch from
master
with the namerelease-2.<minor>.x
.# CLI example, replace `<release-branch>` with the name of the release branch as specified above $ cd bigbang $ git checkout master $ git pull $ git checkout -b <release-branch> $ git branch --show-current $ git push -u origin <release-branch>
-
-
Check last release SHAs -
Verify that the previous release branch commit hash matches the last release tag hash. Investigate with previous release engineer if they do not match.
-
R2-D2:
Check last release SHAs
Manual Steps
- Go to Branches in Big Bang (product).
- Find the latest release branch. It should be in the format release-<major_version>.<minor_version>.x (e.g. release-2.7.x). You will see an eight-character commit hash below it (e.g. 6c746edd)
- Go to Tags in a separate tab. Here you will see the latest release tag listed first. Verify that this tag has the same hash you found in the previous step.
-
-
Create release branch in Big Bang (dogfood). # CLI example, replace `<release-branch>` with the name of the release branch as specified above $ cd dogfood-bigbang $ git checkout master $ git pull $ git checkout -b <release-branch>
-
Upgrade Big Bang (product) version references -
R2-D2: Upgrade Version references
Manual Steps
💡 Tip Make the following changes in a single commit so that they can be cherry picked into master later. All changes should be made in the Big Bang (product) repository-
In base/gitrepository.yaml
, update ref.tag to your current release version. -
Update chart release version chart/Chart.yaml
-
Add a new changelog entry for the release, ex: ## [2.<minor>.0] - [!2.<minor>.0](https://repo1.dso.mil/big-bang/bigbang/-/merge_requests?scope=all&utf8=%E2%9C%93&state=merged&milestone_title=2.<minor>.0); List of merge requests in this release. <!-- Note: milestone_title=2.<minor>.0 version must match the given minor release version -->
-
Update /docs/understanding-bigbang/configuration/base-config.md
usinghelm-docs
.# example release 2.<minor>.x $ cd bigbang $ git checkout release-2.<minor>.x $ docker run -v "$(pwd):/helm-docs" -u $(id -u) jnorwood/helm-docs:v1.5.0 -s file -t .gitlab/base_config.md.gotmpl --dry-run > ./docs/understanding-bigbang/configuration/base-config.md
-
-
Update docs/packages.md on your release branch on Big bang : Add any new Packages. Review if any columns need updates (mTLS STRICT is a common change, follow the other examples where STRICT is noted). Also make sure to update and remove the BETA
badge from any packages that have moved out of BETA. -
Commit changes git commit -am 'version updates for release <release-branch-name>'
-
Push changes ( git push
)
-
-
Reach out to Maintainers 1 to review your commits to the release branch.
2. Upgrade and Debug Cluster
⚠ ️ WARNING: Upgrade only, do not delete and redeploy.
-
Check and Renew ElasticSearch (ECK) License
./gogru eck-license-check
-
Check and Renew Mattermost License
./gogru mm-license-check
-
Upgrade Flux
./gogru update-flux
-
Upgrade Dogfood
./gogru upgrade-dogfood
After running the upgrade-dogfood
command, an MR will be opened in your name. Have the Maintainers 1 merge your branch into master
. Once merged, flux
should pick up the changes and begin reconcile the helm releases. This may take up to an hour depending on the size of the release.
As an extra check, you can watch the Release helmreleases, gitrepositories, pods, and kustomizations to check when all HRs have properly reconciled. Gogru's dogfoodCheck only looks at the HR's and Pods for now.
watch kubectl get gitrepositories,kustomizations,hr,po -A
If flux has not updated after ten minutes:
flux reconcile hr -n bigbang bigbang --with-source --force
- If flux is still not updating, delete the flux source controller:
kubectl delete pod -n flux-system -l app=source-controller
- If the helm release shows max retries exhausted, check a describe of the HR. If it shows "another operation (install/upgrade/rollback) is in progress", this is an issue caused by too many Flux reconciles happening at once and typically the Helm controller crashing. You will need to delete helm release secrets and reconcile in flux as follows. Note that ${HR_NAME} is the same HR you described which is in a bad state (typically a specific package and NOT bigbang itself).
# example w/ kiali
$ HR_NAME=kiali
$ kubectl get secrets -n bigbang | grep ${HR_NAME}
# example output:
# some hr names are duplicated w/ a dash in the middle, some are not
sh.helm.release.v1.kiali-kiali.v1 helm.sh/release.v1 1 18h
sh.helm.release.v1.kiali-kiali.v2 helm.sh/release.v1 1 17h
sh.helm.release.v1.kiali-kiali.v3 helm.sh/release.v1 1 17m
# Delete the latest one:
$ kubectl delete secret -n bigbang sh.helm.release.v1.${HR_NAME}-${HR_NAME}.v3
# suspend/resume the hr
$ flux suspend hr -n bigbang ${HR_NAME}
$ flux resume hr -n bigbang ${HR_NAME}
- If you see errors about no space left when you kubectl describe a failed deployment. The logs have probably filled the filesystem of the node. Determine which node the deployment was scheduled on and ssh to it and delete the logs. You need to have sshuttle running in order to reach the IP of the nodes.
ssh -i ~/.ssh/dogfood.pem ec2-user@xx.x.x.xx
sudo -i
rm -rf /var/log/containers/*
rm -rf /var/log/pods/*
Then the deployment should recover on its own
Verify cluster has updated to the new release
-
Run ./gogru dogfoodCheck
and verify that there are no outstanding issues with Pods or Helm Releases
Prometheus
does not start and is logging an error relating to Vault permissions, follow these instructions to reissue the Vault token.
Old R2D2 Steps (skip if using gogru commands)
-
Connect to the dogfood cluster # Modify ~/.ssh/id_rsa to the path for your SSH key sshuttle --dns -vr ec2-user@$(aws ec2 describe-instances --filters Name=tag:Name,Values=dogfood2-bastion --output json | jq -r '.Reservations[0].Instances[0].PublicIpAddress') 192.168.28.0/24 --ssh-cmd 'ssh -i ~/.ssh/id_rsa'
📌 NOTE: If you have issues with the AWS CLI commands, adding via the AWS web console is another option. Reach out to maintainers 1 for assistance.
Default Application Credentials
-
Review Elasticsearch health and trial license status: -
The url to do this is https://kibana.dogfood.bigbang.mil/login?next=%2F. Log on with the information for
Logging (Kibana)
- BUG: Could not find in UI. Per the docs. Click Hamburger Menu > Stack Management > License Management
- Work around: Validate via the API. Click Hamburger Menu > Dev Tools and enter
GET _license
into the console, then click ► send request, if results showexpiry
date then license is still valid.
-
Run a 'kubectl get pods -A' and confirm that all the pods are less than 30 days old. Also, login to Kibana via SSO - SSO is paywalled so it will fail if the license is expired. If the license is expired, follow the below steps to renew it.
📌 NOTE: only run if trial is expired. after running this you will need to re-configure role mapping-
R2-D2: ECK Trial Renewal
Manual Steps Renew ECK Trial
kubectl delete hr ek eck-operator -n bigbang kubectl delete ns eck-operator logging flux reconcile kustomization environment -n bigbang # All the following 6 reconciles used to be "suspend and resume", and were changed to forced reconciles to match the [new flux change](https://github.com/fluxcd/flux2/releases/tag/v2.2.0) flux reconcile hr bigbang -n bigbang --force flux reconcile hr loki -n bigbang --force flux reconcile hr promtail -n bigbang --force flux reconcile hr fluentbit -n bigbang --force flux reconcile hr mattermost -n bigbang --force flux reconcile hr jaeger -n bigbang --force kubectl delete pods -n mattermost --all kubectl delete po -n jaeger -l app.kubernetes.io/component=all-in-one
📌 NOTE: Suspend/Resume and pod cycling for Jaeger and Mattermost used to cycle the mounted Elastic certificates. No more. The forced reconciles above should have the same effect.📌 NOTE: re-configure role mapping this is the final step to accomplish renewing the ECK trial license. -
-
If you are unable to log into ElasticSearch at all, the attached storage might be too full.
Example error:CLusterBlockException: blocked by: [TOO_MANY_REQUESTS/12/disk usage exceeded flood-stage watermark, index has read-only-allow-delete-block]
Manual Steps to Delete Indexes
# Log into a master node kubectl exec -it -n logging logging-ek-es-master-0 -- /bin/bash # Find a list of large indices curl -k -u elastic:PASSWORD "https://localhost:9200/_cat/indices?v=true&pretty" # Delete indices until you free up enough space to gain the ability to log in curl -k -X DELETE -u elastic:PASSWORD "https://localhost:9200/logstash-2024.04.23"
-
-
Review Mattermost Enterprise trial license status: 📌 NOTE: Mattermost no longer tells you explicitly that the license isexpired
. It will only show that it is now on thefree
version when the enterprise trial has expired-
Login to https://chat.dogfood.bigbang.mil as the root user. The credentials can be found in the dogfood cluster's secret with:
kubectl get secret -n bigbang environment-bb-f86gg49kch -o yaml | yq -e '.data."values.yaml"' | base64 -D | yq .addons.mattermost
(Note: if the secret does not exist, the name may have changed. Look for theenvironment-<string>
secret in thebigbang
namespace) or in the encrypted file by runningsops -d bigbang/prod2/environment-bb-secret.enc.yaml | yq e '.stringData."values.yaml"' | yq .addons.mattermost
in your dogfood repo. -
Navigate to the System Console -> Edition and License tab. If the license is expired, follow the steps below to renew it.
📌 NOTE: only run if trial is expired.Renew Mattermost Enterprise Trial
To "renew" Mattermost Enterprise trial license, connect to RDS postgres DB using
psql
. Follow the guide (guide will need to be sops decrypted) to connect to the DB. Contact the maintainers 1 if you need additional assistance.📌 NOTE: The psql pod may fail to come up.Fix the psql pod
It may be because credentials have expired. If you check the events on that pod, you may see 4xx error on pulling the images. If you do, you can validate that the credentials are expired by cycling the mattermost deployment. If it fails to come up as well, reach out to the Maintainers [^maintainers] for new credentials. Note you'll need to redo flux after updating the credentials, follow the instructions for updating flux later in this doc, but don't use -s instead provide the new credentials that the Maintainers [^maintainers] provided you.Then run the commands below from within the psql connection, which will cycle the license:
\c mattermost select * from "public"."licenses"; delete from "public"."licenses"; \q # At this point you can disconnect from the container with ctrl+c if it doesn't automatically quit kubectl delete mattermost mattermost -n mattermost flux reconcile hr mattermost -n bigbang --force
Validate that the new Mattermost pod rolls out successfully. If it hasn't reconciled you may need to suspend/resume bigbang again. Login as a system admin, navigate to the System Console -> Edition and License tab. Click the "Start trial" button. If a prompt asking you for some trial information appears, just enter a fake email address -- it does not matter.
-
-
If Flux has been updated in the latest release: -
Run the Flux install script on your locally cloned Big Bang repo as in the example below
$ cd bigbang $ git checkout release-2.<minor>.x $ git pull $ ./scripts/install_flux.sh -s # the `-s` option will reuse the existing secret so you don't have to provide credentials $ cd ../dogfood-bigbang # go back to dogfood after flux upgrade
-
-
Before upgrading the cluster do a sanity check on health of existing cluster resources. Check pods, helmreleases, etc. You want to be aware of and possibly fix any issues before you upgrade the Big Bang deployment. -
Check Helm Releases all show Ready True
.
kubectl get hr -A
-
Check pods in completed
orrunning
state, and all containers running in the pod - ie:Ready 1/1 or 3/3
etc.
kubectl get pods -A
-
-
Upgrade the release branch on dogfood cluster master
by completing the following.-
R2-D2: Upgrade Dogfood
Manual Steps
# Create a new branch. $ git checkout -b <branch_name>
-
Upgrade base kustomization ref=release-2.<minor>.x
inbigbang/base2/kustomization.yaml
to the release branch. -
Upgrade prod kustomization branch: "release-2.<minor>.x"
inbigbang/prod2/kustomization.yaml
to the release branch. -
Verify the changes above are correct, then:
# Add & push changes $ git add bigbang/base2 bigbang/prod2 $ git commit -m "upgrade kustomizations to release-2.<minor>.x" $ git push -u origin <branch_name>
-
Create a new MR with your kustomization changes pointing to master
, and get it approved and merged by the Maintainers 1
After pushing the changes and opening an MR, have the Maintainers 1 merge your branch into
master
. Once merged,flux
should pick up the changes and begin reconcile the helm releases. This may take up to an hour depending on the size of the release. -
-
Verify cluster has updated to the new release -
Packages have fetched new git repository revisions and match the new release $ kubectl get gitrepositories -A
-
Packages have reconciled -
Watch the Release helmreleases, gitrepositories, and kustomizations to check when all HRs have properly reconciled
# check release watch kubectl get gitrepositories,kustomizations,hr,po -A
📌 NOTE: Manual Steps in the event that Flux does not reconcile for 10 minutesIf flux has not updated after ten minutes:
flux reconcile hr -n bigbang bigbang --with-source --force
- If flux is still not updating, delete the flux source controller:
kubectl delete pod -n flux-system -l app=source-controller
- If the helm release shows max retries exhausted, check a describe of the HR. If it shows "another operation (install/upgrade/rollback) is in progress", this is an issue caused by too many Flux reconciles happening at once and typically the Helm controller crashing. You will need to delete helm release secrets and reconcile in flux as follows. Note that ${HR_NAME} is the same HR you described which is in a bad state (typically a specific package and NOT bigbang itself).
# example w/ kiali $ HR_NAME=kiali $ kubectl get secrets -n bigbang | grep ${HR_NAME}
# example output: # some hr names are duplicated w/ a dash in the middle, some are not sh.helm.release.v1.kiali-kiali.v1 helm.sh/release.v1 1 18h sh.helm.release.v1.kiali-kiali.v2 helm.sh/release.v1 1 17h sh.helm.release.v1.kiali-kiali.v3 helm.sh/release.v1 1 17m
# Delete the latest one: $ kubectl delete secret -n bigbang sh.helm.release.v1.${HR_NAME}-${HR_NAME}.v3 # suspend/resume the hr $ flux suspend hr -n bigbang ${HR_NAME} $ flux resume hr -n bigbang ${HR_NAME}
- If you see errors about no space left when you kubectl describe a failed deployment. The logs have probably filled the filesystem of the node. Determine which node the deployment was scheduled on and ssh to it and delete the logs. You need to have sshuttle running in order to reach the IP of the nodes.
ssh -i ~/.ssh/dogfood.pem ec2-user@xx.x.x.xx sudo -i rm -rf /var/log/containers/* rm -rf /var/log/pods/*
Then the deployment should recover on its own
-
-
Verify cluster has updated to the new release -
Run kubectl get pods -A
and verify that all Pods are in "Running" or "Completed" status.
-
3. UI Tests
Default Application Credentials
You may utilize the R2-D2: Run All Tests
option which will run all of the tests below it; however before you begin using it you'll need to know your P1 login, as well as a login token for the gitlab.dogfood.bigbang.mil (see gitlab
section below for instructions on creating that)
Logging
-
Login to kibana with SSO. 📌 NOTE: If you get "You do not have permission to access the requested page," follow the instructions under "Renew ECK Trial" in the 2. Upgrade and Debug Cluster section above. Don't forget to log in as admin and reconfigure role mapping after you run the kubectl and flux commands. -
Verify that Kibana is actively indexing/logging. - To do this, click on the drop-down menu on the upper-left corner, then Under "Analytics" click Discover. Click "Create data view." In the "Index pattern" field, enter
jaeger*
. Set the "Timestamp field" toI don't want to use the time filter
, then click "Use without saving". Log data should populate. - If it is not indexing (you see no index for jaeger), you may have renewed the ElasticSearch license and jaeger did not reconcile after the renewal. Force jaeger to pick up the new configuration by doing:
flux reconcile hr -n bigbang jaeger --force --with-source
- To do this, click on the drop-down menu on the upper-left corner, then Under "Analytics" click Discover. Click "Create data view." In the "Index pattern" field, enter
Kiali
-
Login to kiali with SSO -
Validate graphs and traces are visible under applications/workloads -
Graphs at the overview -
Graphs for inbound metrics -
Graphs for outbound metrics -
Graphs for Traces
-
-
Validate no errors appear ℹ ️ NOTE Red notification bell would be visible if there are errors. Errors on individual application listings for labels, etc are expected and OK.
Twistlock
-
Log in here - In the Targets drop down choose
serviceMonitor/twistlock/twistlock-console/0
- Click the small "show more" button that appears. You should see a State of UP and no errors.
- In the Targets drop down choose
-
R2-D2: Twistlock Test
Manual Steps
-
Login to twistlock/prisma cloud with the credentials in the secret: kubectl get secret -n twistlock twistlock-console -o go-template='{{.data.TWISTLOCK_USERNAME | base64decode}}' ; echo ' <- username' kubectl get secret -n twistlock twistlock-console -o go-template='{{.data.TWISTLOCK_PASSWORD | base64decode}}' ; echo ' <- password'
-
Under Manage -> Defenders, make sure the number of defenders online is equal to the number of nodes on the cluster. You can list cluster nodes with kubectl get nodes -A
Defenders will scale with the number of nodes in the cluster. If there is a defender that is offline, check whether the node exists in cluster anymore. Cluster autoscaler will often scale up/down nodes which can result in defenders spinning up and getting torn down. As long as the number of defenders online is equal to the number of nodes everything is working as expected.📌 NOTE: Pods cannot be allocated to the control-plane nodes, so Twistlock may only show a number of defenders equal to the number of nodes minus the number of control plane nodes
-
Fortify
-
R2-D2: Fortify Test
Fortify Manual Steps
-
Login to fortify as the default admin (find credentials with sops -d bigbang/prod2/environment-bb-secret.enc.yaml | sed 's/\\n/\'$\'\n/g' | grep "Fortify admin"
) -
Validate that the Dashboard page has sections titled Issues Remediated
andIssues Pending Review
-
GitLab Sonarqube Gitlab Runner
📌 NOTE: if gitlab has updated to v18 and pipelines are not running, need to follow this procedure to register the runner Runner Registration, right now this process works but v18 gets rid of that bypass capability and we will have to follow a new process like here.
-
Login to gitlab with SSO -
Edit profile and change user avatar. If your avatar doesn't change right away, check again later. It can take several minutes or more. -
Go to Security > Access Token and create a personal access token. Select all scopes. Record the token. You'll need it for the next two steps. -
R2-D2: Gitlab Test
Gitlab Manual Steps
-
Create new public group with release name. ie. release-2-<minor>-x
-
Create new public project (under the group you just made), also with release name (e.g. release-2.7.x
if you're working on release 2.7.0). -
git clone project -
Pick one of the project folders from Sonarqube Samples and copy all the files into your clone from dogfood -
Git commit and push your changes to the repo. When prompted enter your username and the access token as your password.
-
-
Test a docker image push/pull to/from registry docker pull alpine docker tag alpine registry.dogfood.bigbang.mil/<GROUPNAMEHERE>/<PROJECTNAMEHERE>/alpine:latest docker login registry.dogfood.bigbang.mil # Enter your gitlab username and personal access token docker push registry.dogfood.bigbang.mil/<GROUPNAMEHERE>/<PROJECTNAMEHERE>/alpine:latest docker image rm registry.dogfood.bigbang.mil/<GROUPNAMEHERE>/<PROJECTNAMEHERE>/alpine:latest docker pull registry.dogfood.bigbang.mil/<GROUPNAMEHERE>/<PROJECTNAMEHERE>/alpine:latest
Sonarqube Manual Steps
-
Login to sonarqube with SSO -
Add a project for your release. When prompted for how you want to analyze your repository, choose "Locally." -
Generate a token for the project and copy the token somewhere safe for use later -
When prompted to "Run analysis on your project" choose "Other (for JS, TS, Go, Python, PHP...)". For "What is your OS?" choose Linux. Copy everything that appears under "Running a SonarQube analysis is straightforward. You just need to execute the following commands in your project's folder" and save it somewhere secure for use later. -
After completing the gitlab runner test return to sonar and check that your project now has analysis ℹ ️ NOTE The project token and project key are different values.
Gitlab Runner Manual Steps
-
Log back into gitlab and navigate to your project -
Under settings, CI/CD, variables add two vars: -
SONAR_HOST_URL
set equal tohttps://sonarqube.dogfood.bigbang.mil/
-
SONAR_TOKEN
set equal to the token you copied from Sonarqube earlier (make this masked)
-
-
Under settings, CI/CD, deselect "Default to Auto DevOps pipeline" and click Save changes
. -
Add a .gitlab-ci.yml
file to the root of the project, paste in the contents of sample_ci.yaml, replacing-Dsonar.projectKey=XXXXXXX
with what you copied earlier -
Commit, validate the pipeline runs and succeeds (may need to retry if there is a connection error), then return to the last step of the sonar test
-
Sonarqube
📌 NOTE: If the Sonarqube is getting a major version update, you will need to go to 'http://sonarqube.dogfood.bigbang.mil/setup' to upgrade before you will be able to login.
-
Login to sonarqube with SSO -
Verify your Project says PASSED
Nexus
-
R2-D2: Nexus Test
Nexus Manual Steps
-
Login to Nexus as admin, password is in the nexus-repository-manager-secret
secret:# username is admin, password is the output of this command kubectl get secret -n nexus-repository-manager nexus-repository-manager-secret -o go-template='{{index .data "admin.password" | base64decode}}' ; echo
-
Validate there are no errors displaying in the UI (an "Available CPUs" error about the host system "allocating a maximum of 1 cores to the application" is acceptable). -
Push/pull an image to/from the nexus registry -
With the credentials from the encrypted values (or the admin user credentials) login to the nexus registry docker login containers.dogfood.bigbang.mil
-
Tag and push an image to the registry: # ex: <release> = `2-12-0` $ docker tag alpine:latest containers.dogfood.bigbang.mil/alpine:<release> $ docker push containers.dogfood.bigbang.mil/alpine:<release>
-
Pull down the image for the previous release # ex: <last-release> = `2-12-0` $ docker pull containers.dogfood.bigbang.mil/alpine:<last-release>
-
-
Minio
-
R2-D2: Minio Test
Minio Manual Steps
- [ ] Log into the Minio UI - access and secret key are in the `minio-root-creds-secret` secret# ex: <last-release> = `2-12-0` $ docker pull containers.dogfood.bigbang.mil/alpine:<last-release>
-
Create bucket -
Store file to bucket. To do this after creating the bucket, click on Object Browser, then click the bucket, then Upload. -
Download file from bucket -
Delete bucket and files
-
Mattermost
-
R2-D2: Mattermost Test
Mattermost Manual Steps
-
Login to mattermost with SSO. -
Update/modify profile picture -
Log out and log back in as admin. You can login as the robot admin if you do not have this access (find credentials in encrypted values or with sops -d bigbang/prod2/environment-bb-secret.enc.yaml | sed 's/\\n/\'$\'\n/g' | grep "Robot admin"
) -
Send chats and validate that chats from previous releases are visible. -
Under System Console -> Environment, click Elasticsearch and then click Test Connection and Index Now. Validate that both are successful. If Elasticsearch does not appear as an option, go to About > Edition and License on the System Console menu and click the Start Trial
button.📌 NOTE: Mattermost may display a warning that the ElasticSearch version is too high, but the test connection should complete successfully regardless
-
ArgoCD
-
R2-D2: ArgoCD Test
ArgoCD Manual Steps
-
Login to argocd with SSO -
Logout and login with username admin
. The password is in theargocd-initial-admin-secret
secret. If that doesn't work attempt a password reset:kubectl -n argocd get secret argocd-initial-admin-secret -o json | jq '.data|to_entries|map({key, value:.value|@base64d})|from_entries'
-
Create application -
Click [Create Application]
, fill in the belowSetting Value Application Name podinfo Project default Sync Policy Automatic Sync Policy check both boxes Sync Options check "auto-create namespace" Repository URL https://repo1.dso.mil/big-bang/apps/sandbox/podinfo.git Revision HEAD Path chart Cluster URL https://kubernetes.default.svc Namespace podinfo -
Click [Create]
(top of page) -
Validate app syncs/becomes healthy
-
-
Delete app
-
Kyverno
-
R2-D2: Kyverno Test
Kyverno Manual Steps
📌 NOTE: if using MacOS make sure that you have gnu sed installed and add it to your PATH variable GNU SED Instructions-
Test secret sync in new namespace # create secret in kyverno NS kubectl create secret generic \ -n kyverno kyverno-bbtest-secret \ --from-literal=username='username' \ --from-literal=password='password' # Create Kyverno Policy kubectl apply -f https://repo1.dso.mil/big-bang/product/packages/kyverno/-/raw/main/chart/tests/manifests/sync-secrets.yaml # Wait until the policy shows as ready before proceeding kubectl get clusterpolicy sync-secrets # Create a namespace with the correct label (essentially we are dry-running a namespace creation to get the yaml, adding the label, then applying) kubectl create namespace kyverno-bbtest --dry-run=client -o yaml | sed '/^metadata:/a\ \ labels: {"kubernetes.io/metadata.name": "kyverno-bbtest"}' | kubectl apply -f - # Check for the secret that should be synced - if it exists this test is successful kubectl get secrets kyverno-bbtest-secret -n kyverno-bbtest
-
Delete the test resources # If above is successful, delete test resources kubectl delete -f https://repo1.dso.mil/big-bang/product/packages/kyverno/-/raw/main/chart/tests/manifests/sync-secrets.yaml kubectl delete secret kyverno-bbtest-secret -n kyverno kubectl delete ns kyverno-bbtest
-
Velero
-
R2-D2: Velero Test
📌 NOTE: This test fails occasionally. You may need to run it multiple times to get it to pass.Velero Manual Steps
-
Login to https://nexus.dogfood.bigbang.mil/#browse/welcome as admin. Confirm the following to allow the test deployment to pull the alpine image from Nexus without credentials -
Verify that the velero-tests.dogfood.bigbang.mil/alpine:test repository exists on Nexus -
Verify anonymous docker pull is enabled for velero-tests repository: Settings -> Repositories -> velero-tests -> Check Allow anonymous docker pull (Docker Bearer Token Realm required)
-> Save -
Go to Settings -> Security -> Realms and verify that Docker Bearer Token
is in theActive
column. If not, click on it to move it there, then clickSave
-
Verify Anonymous Access is enabled. Settings -> Security -> Anonymous Access -> Check Allow anonymous users to access the server
-> Save -
Install the velero CLI on your workstation if you don't already have it (for MacOS, run brew install velero
). -
Then set VERSION to the release you are testing: $ VERSION=2-<minor>-0
-
Backup PVCs using velero_test.yaml. $ kubectl apply -f ./docs/release/velero_test.yaml # wait 30s for velero to be ready then: # exec into velero_test container, check log $ veleropod=$(kubectl get pod -n velero-test -o json | jq -r '.items[].metadata.name') $ kubectl exec $veleropod -n velero-test -- tail /mnt/velero-test/test.log
-
Then set VERSION to the release you are testing and use the CLI to create a test backup: $ VERSION=2-<minor>-0 $ velero backup create velero-test-backup-${VERSION} -l app=velero-test $ velero backup get
-
Wait a bit, re-run velero backup get
, when it shows "Completed" delete the app.$ kubectl delete -f ./docs/release/velero_test.yaml # namespace "velero-test" deleted # persistentvolumeclaim "velero-test" deleted # deployment.apps "velero-test" deleted
-
Restore the test resources from the backup $ velero restore create velero-test-restore-${VERSION} --from-backup velero-test-backup-${VERSION} # exec into velero_test container $ kubectl exec $veleropod -n velero-test -- cat /mnt/velero-test/test.log
-
Confirm both old and new log entries appear in logs, this confirms backup was done correctly -
Example output of container logs:
Running command: kubectl exec velero-test-6549b5768d-872jc -n velero-test -- tail /mnt/velero-test/test.log Command output: Fri Jul 21 13:47:00 UTC 2023 Fri Jul 21 13:47:10 UTC 2023
Running command: kubectl exec velero-test-6549b5768d-872jc -n velero-test -- cat /mnt/velero-test/test.log Command output: Fri Jul 21 13:47:00 UTC 2023 Fri Jul 21 13:47:10 UTC 2023 Fri Jul 21 13:47:20 UTC 2023 Fri Jul 21 13:47:30 UTC 2023 Fri Jul 21 13:50:45 UTC 2023 Fri Jul 21 13:50:55 UTC 2023
-
Cleanup test and delete resources $ kubectl delete -f ./docs/release/velero_test.yaml # namespace "velero-test" deleted # persistentvolumeclaim "velero-test" deleted # deployment.apps "velero-test" deleted
-
Loki
-
R2-D2: Loki Test
Loki Manual Steps
-
Click on the drop-down in the upper-left corner again and choose Data Sources, then choose Loki. Scroll down and click Save & Test
. A message should appear that reads "Data source successfully connected."
-
-
Login to grafana as admin. User name "admin". Retrieve password with sops -d bigbang/prod2/environment-bb-secret.enc.yaml | sed 's/\\n/\'$\'\n/g' | grep grafana -A 2
-
Click on the drop-down on upper-left, choose Dashboards, then Loki Dashboard quick search
. Verify that this dashboard shows data/logs from the past few minutes and hours. -
Click on the drop-down on upper-left, choose Dashboards, then Loki / Operational
. Verify that both the Distributor Success Rate and Ingester Success rate are 100%, and that when you expand the Chunks item (toward the bottom of the page), data exists.📌 NOTE: this dashboard is currently missing and we have a task here to add it back in
Tempo
-
R2-D2: Tempo Test
Tempo Manual Steps
-
Login to grafana as admin. User name "admin". Retrieve password with sops -d bigbang/prod2/environment-bb-secret.enc.yaml | sed 's/\\n/\'$\'\n/g' | grep grafana -A 2
-
Click on the drop-down menu on the upper-left corner, then choose Data Sources, then Tempo. Scroll down and click Save & Test
. A message should appear that reads "Data source successfully connected." -
Visit tempo tracing & ensure Services are populating under Service
drop down. For example, you might see jaeger-query and tempo-grpc-plugin listed as options.📌 NOTE: this test is obsolete at the moment as we do not have any data tethered to tempo at the moment
-
Keycloak
-
R2-D2: Keycloak Test
Keycloak Manual Steps
-
Login to Keycloak admin console. The credentials are in the keycloak-env
secret:kubectl get secret keycloak-env -n keycloak -o jsonpath="{.data.KEYCLOAK_ADMIN}" | base64 -d ; echo " <- admin user" kubectl get secret keycloak-env -n keycloak -o jsonpath="{.data.KEYCLOAK_ADMIN_PASSWORD}" | base64 -d ; echo " <- password"
-
Neuvector
-
R2-D2: Neuvector Test
Neuvector Manual Steps
-
Login to Neuvector with the default login (currently admin:admin, update this to something secure) - If admin password is unknown reset it to default
-
Navigate to Assets -> System Components and validate that all components are showing as healthy -
Under Assets -> Containers, click on any image and run a scan. When the scan finishes, click on the container. You'll see the results in the Compliance and Vulnerabilities tabs below. -
Under Network Activity, validate that the graph loads and shows pods and traffic. This graph can take several minute or more to load. You may want to leave the tab up and move on to the next UI test while it populates.
-
Grafana & ArgoCD & Anchore & Mattermost SSO
-
R2-D2: Native SSO Grafana, ArgoCD, Anchore and Mattermost Test
📌 NOTE: Grafana OPA Violations dashboard showing "No Data" is a known issue. This has been skipped since 2.18.0, current open issue awaiting fix is here.SSO Manual Steps
-
Login to grafana with SSO -
Login to ArgoCD with SSO -
Login to anchore with SSO -
Login to mattermost
-
Monitoring & Cluster Auditor & Tracing (Jaeger) & Alertmanager
-
R2-D2: authservice SSO Prometheus, Tracing, AlertManager, and Cluster Auditor Test
SSO2 Manual Steps
-
Load alertmanager, login with SSO, and validate that the watchdog alert at minimum is firing -
Load tracing, login with SSO, and ensure there are no errors on main page and that traces can be found for apps 📌 NOTE: Grafana OPA Violations dashboard showing "No Data" is a known issue. This has been skipped since 2.18.0, current open issue awaiting fix is here. -
Login to grafana with SSO -
OPA Violations dashboard is present and shows violations in namespaces -
Click on the drop-down menu in the upper-left and choose Dashboards. Verify that several Kubernetes and Istio dashboards are listed. Click on a few of these and verify that metrics are populating in them. -
Login to prometheus -
Go to Status > Targets. Validate that no unexpected services are marked as Unhealthy KNOWN UNHEALTHY INCLUDE: serviceMonitor/monitoring/monitoring-monitoring-kube-kube-controller-manager/0 serviceMonitor/monitoring/monitoring-monitoring-kube-kube-etcd/0 serviceMonitor/monitoring/monitoring-monitoring-kube-kube-scheduler/0
📌 NOTE: if vault shows up as UNHEALTHY, that means it was rebuilt and we need to update the permissions on thevault/secrets/token
file within the prometheus pod. (we have an issue to fix this)- exec into the
vault-agent
container on theprometheus-monitoring-monitoring-kube-prometheus-0
pod and run the following
chmod 644 /vault/secrets/token
- exec into the
-
Anchore
-
log back in as the admin user - password is in anchore-anchore-enterprise
secret (admin will have pull credentials set up for the registries):kubectl get secret anchore-anchore-enterprise -n anchore -o json | jq -r '.data.ANCHORE_ADMIN_PASSWORD' | base64 -d; echo ' <- password'
-
Scan image in dogfood registry, registry.dogfood.bigbang.mil/GROUPNAMEHERE/PROJECTNAMEHERE/alpine:latest
-
Go to Images Page, Analyze Tag - Registry:
registry.dogfood.bigbang.mil
- Repository:
GROUPNAME/PROJECTNAME/alpine
- Tag:
latest
- Registry:
-
-
Scan image in nexus registry, containers.dogfood.bigbang.mil/alpine:<release-number>
(use your release number, ex:1-XX-0
). If authentication fails that means that the Nexus credentials have changed. Retrieve the Nexus credentials using instructions from Nexus above. Update creds in Anchore by clicking on menu Images > Analyze Repository > link at bottom "clicking here" > click on registry name > then edit the credentials. -
Validate scans complete and Anchore displays data (click the SHA value for each image)
4. Create Release
🛑 STOP: Before finalizing the release, confirm that the latest bb-docs-compiler pipeline passed. If not, identify, correct, and cherry-pick a fix into the release branch. This will prevent wasting time re-running pipelines if the docs compiler pipeline fails.
-
In dogfood repo, create a MR with the release notes created in the step below, and ask the Maintainers 1 to merge before proceeding. -
Build the release notes: -
R2-D2: run with the Build release notes
option selected -
Commit the resulting release notes and push to <release-branch>
in Big Bang (dogfood)
Manual Steps
# clone the dogfood cluster repo $ git clone https://repo1.dso.mil/big-bang/team/deployments/bigbang/ dogfood-bigbang # cd into the dogfood-bigbang repo's release notes dir $ cd dogfood-bigbang/docs/release # Select the `Build release notes` option $ r2d2 # commit the release notes $ cd ../../ $ git add . $ git commit -m "add initial release notes" $ git push -u origin <branch_name>
-
-
clean up release notes -
Check to see if any packages have been added or graduated from BETA. There is no single resource that lists this information, but you should generally be aware of packages that have been added or moved out of BETA from discussions with other team members during your workdays. You can also ask the Maintainers 1 if unsure. If any packages have been added or moved out of BETA, adjust your local R2D2 config as needed. Also open an MR for R2D2 with the change to the default config. -
Create Upgrade Notices, based off of the listed notices in the release issue. Also review with maintainers to see if any other major changes should be noted. -
Move MRs from 'Big Bang MRs' into their specific sections below. If MR is for Big Bang (docs, CI, etc) and not a specific package, they can remain in place under 'Big Bang MRs'. General rule of thumb: if the git tag changes for a package - put the MR under that package; and if there is no git tag change for a package - put it under BB. -
Adjust known issues as needed: If an issue has been resolved it can be removed and if any new issues were discovered they should be added. -
Verify table contains all chart upgrades mentioned in the MR list below. -
Verify table contains all application versions. Compare with previous release notes. If a package was upgraded there will usually be a bump in application version and chart version in the table. -
Verify all internal comments are removed from Release Notes, such as comments directing the release engineer to copy/move things around -
Any issues encountered during release upgrade testing need to be thoroughly documented in the release notes. If you had to delete a replica set, force restart some pods, delete a PV, please make sure these steps are included in our known issues section of the release notes.
-
-
Finalize the tag in chart/Chart.yaml
(remove-rc.x
if present), commit and push this change -
If there were any cherry-picks during Step 3, in dogfood repo: re-run R2D2 job Build release notes -
Re-run helm docs to update to the latest version + latest package versions. cd bigbang git pull # pull any last minute cherry picks, verify nothing has greatly changed git checkout release-2.<minor>.x docker run -v "$(pwd):/helm-docs" -u $(id -u) jnorwood/helm-docs:v1.5.0 -s file -t .gitlab/base_config.md.gotmpl --dry-run > ./docs/understanding-bigbang/configuration/base-config.md # commit and push the changes (if any)
-
Create release candidate tag based on release branch, ie. 2.<minor>.0-rc.0
. Tagging will additionally create release artifacts and the pipeline runs north of 1 hour. You will need to request tag permissions from the Maintainers 1.-
To do this via the UI (generally preferred): tags page -> new tag, name:
2.<minor>.0-rc.0
, create from:release-2.<minor>.x
, message: "release candidate", release notes: leave empty -
To do this via git CLI:
git tag -a 2.<minor>.0-rc.0 -m "release candidate" # list the tags to make sure you made the correct one git tag -n git push origin 2.<minor>.0-rc.0
-
-
Passed pipeline for Release Candidate tag. -
Passed pipeline in bb-docs-compiler for latest RC tag (Gets created/scheduled towards the end of the bigbang release pipeline). Review all pipeline output looking for failures or warnings. Reach out to the maintainers for a quick review before proceeding. Maintainers 1
-
Check the releases for release notes.
📌 NOTE: There is currently an error with automatically adding the release-notes to the release and you will have to add them manually -
Create release tag based on release branch. ie. 2.<minor>.0
.-
To do this via the UI (generally preferred): tags page -> new tag, name:
2.<minor>.0
, create from:release-2.<minor>.x
, message:release 2.<minor>.0
, release notes: leave empty -
To do this via git CLI:
git tag -a 2.<minor>.0 -m "release 2.<minor>.0" # list the tags to make sure you made the correct one git tag -n # push git push origin 2.<minor>.0
-
-
Passed release pipeline. Review all pipeline output looking for failures or warnings. Reach out to the maintainers for a quick review before proceeding. Maintainers 1
- Ensure that the RC tag release created prior to this was deleted as well.
-
Ensure Release Notes are present here -
Ensure that release docs have compiled and appear on Big Bang Docs. -
Expand the drop down in the title bar showing latest
and confirm the current release is present. -
Click around in the docs and make sure docs are properly formatted, pay special attention to lists and tables.
📌 NOTE: BB-docs wont actually publish the latest release version until the nightly bb-docs job runs. you can check this on the following day. -
-
Type up a release announcement for the Maintainers 1 to post in the Big Bang Value Stream channel. Mention any breaking changes and some of the key updates from this release. Use same format as a prior announcement https://chat.il2.dso.mil/platform-one/channels/team---big-bang -
Cherry-pick release commit(s) as needed with merge request back to master branch. We do not ever merge release branch back to master. Instead, make a separate branch from master and cherry-pick the release commit(s) into it. Use the resulting MR to close the release issue. # Navigate to your local clone of the BB repo cd path/to/my/clone/of/bigbang # Get the most up to date master git checkout master git pull # Check out new branch for cherrypicking git checkout -b 2.<minor>-cherrypick # Repeat the following for whichever commits are required from release # Typically this is the initial release commit (bumped GitRepo, Chart.yaml, CHANGELOG, README) and a final commit that re-ran helm-docs (if needed) git cherry-pick <commit sha for Nth commit> git push -u origin 2.<minor>-cherrypick # Create an MR using Gitlab, merging this branch back to master, reach out to maintainers to review
-
Close Big Bang Milestone in GitLab. (make sure that there are n+2 milestones still, the current one and 1 future milestone, if not contact the maintainers 1) -
Handoff the release to the Maintainers 1, they will review then celebrate and announce the release in the public MM channel -
Reach out to @marknelson to let him know that the release is done and to update the Gitlab Banner