Package pipelines depending on CI RDS sometimes fail
Bug
Description
Intermittently we will see issues like this in package pipelines depending upon CI RDS instances
https://repo1.dso.mil/big-bang/bigbang/-/jobs/41249655#L882
psycopg2.errors.InternalError_: tuple concurrently updated
This appears to be in the fifth postgresql call inside of the rds_create tree. This is likely happening on the GRANT USAGE
call, just by looking at the logs:
masterquery "CREATE DATABASE ${newdbname};" || return 1
masterquery "CREATE USER ${newusername} WITH ENCRYPTED PASSWORD '${newpassword}';" || return 1
masterquery "GRANT ALL PRIVILEGES ON DATABASE ${newdbname} TO ${newusername};" || return 1
masterquery "ALTER DATABASE ${newdbname} OWNER TO ${newusername};" || return 1
masterquery "GRANT USAGE, CREATE ON SCHEMA PUBLIC TO ${newusername};" || return 1
From what I can see in the postgres docs, this tuple updated concurrently
error is usually due to either:
- Data corruption in the postgres instance itself (highly unlikely)
- multiple threads attempting to update the same object at the same time without adequate transaction isolation (also unlikely, but not impossible in our case)
- improperly closed or dangling transactions impeding future operations (easier to check for, somewhat more likely)
My recommendation is:
- First, add the official postgresql psql client binary to our CI images and get rid of the python-psycopg2 hack that was put in place to work around the missing binaries.
- Second, ensure all the transactions we're using to execute these statements are fully executed and closed before going to the next one
- Third, see if we can wrap all the 5 statements related to creating the user, role, and db into a single transaction
- Fourth, see if we can do a better job of randomizing the database and user names. Right now it is unlikely but theoretically possible to have two pipelines running at the same time asking for an RDS DB and colliding on the user or database name. The username is calculated based off of the PRNG and the database name is a combination of the username and a timestamp. It is not impossible for two pipeline containers to start at the exact same time, initialize their PRNG with the exact same seed, and therefore generate the exact same username/db combo. I think this is the LEAST likely culprit, but if the first three don't pan out, it's where I would go next.
BigBang Version
2.4x series