Modify wait script to better handle `ArtifactFailed`
See: https://repo1.dso.mil/platform-one/big-bang/bigbang/-/jobs/6345892
Currently our wait script does a check for any failed HRs and will exit if it finds any, checking every 5 seconds (see the code here).
One of the problems with this is that flux sometimes puts HRs in a "failed" state as a result of flux timings/gitrepos not being fully ready. There was an attempt to avoid this issue by adding a wait for all gitrepos to be ready but it seems like sometimes Flux will still report the ArtifactFailed
status even if the gitrepo is ready.
This task should involve:
- Investigate what the
ArtifactFailed
status on a flux HR means - evaluate whether there are situations where we should consider this a real failure - If it is something we should retry/not consider a failure, evaluate options to work around this in the pipeline (retry/ignore, different gitrepo wait, etc)
- Implement the chosen solution
AC:
-
Modified wait script that takes into account the ArtifactFailed
status and handles it in a "smarter" way
Edited by evan.rush