GitLab Registry pod failing to start after BB version 2.46.0 released
Bug GitLab Registry Pod Failure
BigBang Version 2.46.0
Description
We performed BigBang version upgrade 2.46.0 and the GitLab Registry Pod error only occurs when the registry pod rotates for any reason, because of this fact the errors where not initial obvious and the upgrade when smoothly until a few of the the registry pods rotated in different deployments (Dev/Stage/Prod/Ops). This issue doesn't show until the GitLab Registry pod restarts or is manually deleted. If only one pod is deleted any remaining pods are still functional. I have investigated the issue on AWS side to verify that there isn't problems on the AWS side that are causing this issue and I can't find a route cause on the AWS permissions or other resources in AWS.
It is important to note that this is in our Dev deployment which was redeployed with a fresh EKS cluster around a month ago so all old resources shouldn't be present additionally this issue only came up Feb 13th 2025.
The issue is that when the GitLab registry pod comes up after a restart it fails with the following error in the registry pod.
"error":"WebIdentityErr: failed to retrieve credentials\ncaused by: SerializationError: failed to unmarshal error message\n\tstatus code: 405, request id: \ncaused by: UnmarshalError
creating new registry instance: configuring application: s3aws: WebIdentityErr: failed to retrieve credentials caused by: SerializationError: failed to unmarshal error message
status code: 405, request id: caused by: UnmarshalError: failed to unmarshal error message
s3aws: WebIdentityErr: failed to retrieve credentials\ncaused by: SerializationError: failed to unmarshal error message\n\tstatus code: 405
I investigated the AWS permissions and opened the permission wide open to make sure that there is not something wrong with the permissions to the service account accessing information in the S3 bucket and that availed to nothing:
I investigated the issues mentioned in the 2.46.0 release notes and these don't appear to be the culprit either. The 3 secrets mentioned below were already present and we don't manually create the rails secrets in our deployment:
BigBang release debug steps 2.46.0:
- If you have a manually created gitlab-rails secret, your upgrade may fail with:
$ Errno::EBUSY: Device or resource busy @ rb_file_s_rename - (/srv/gitlab/config/secrets.yml, /srv/gitlab/config/secrets.yml.orig.1738013281)
-
In order to resolve this you will likely need to manually generate the other 3 secrets described here
-
if you see in Prometheus the Error scraping target for the Gitlab-exporter, mentioned in the upgrade notices please read the following :
Steps to Resolve
-
Verify Service Monitor Configuration Use the following command to check if the
fallbackScrapeProtocol
line is present:kubectl -n gitlab get servicemonitor gitlab-gitlab-exporter -o yaml
If thefallbackScrapeProtocol: PrometheusText1.0.0
is missing, proceed with the next steps. -
Update Service Monitor First, export the current service monitor configuration:
kubectl -n gitlab get servicemonitor gitlab-gitlab-exporter -o yaml > servicemonitor_gitlab_exporter.yaml
Then, delete the existing service monitor:
kubectl -n gitlab delete servicemonitor gitlab-gitlab-exporter
-
Redeploy or Update the Helm Release Redeploy BigBang or force a redeployment of the Helm release. This should ensure that the
fallbackScrapeProtocol: PrometheusText1.0.0
is included, resolving the Prometheus scraping error.
-