release(v1.7.41-alpha): post-OTA auto-rollback so a bad release cannot strand the fleet

Closes failure mode FM5 from docs/bulletproof-containers.md: the v1.7.38 +
v1.7.39 rollouts left every affected node on an unreachable UI (nginx 500)
with no recovery path short of SSH. This release adds a self-check
guardrail to the update flow.

What changed:
- apply_update() writes a pending-verify marker with old+new version and
  a 150s deadline immediately before scheduling the service restart.
- verify_pending_update() runs from main.rs startup. If the marker is
  present and within its freshness window, the new binary waits 15s for
  nginx + backend to settle, then probes https://127.0.0.1/ every 5s for
  up to 90s (self-signed certs accepted).
- On any probe success within the window, the marker is cleared and
  nothing else happens.
- On window-exhaust, the new binary:
    1. Moves the broken /opt/archipelago/web-ui to web-ui.failed.<ts>
       (quarantined, not deleted, so we can post-mortem).
    2. Restores web-ui.bak on top of web-ui.
    3. Calls rollback_update() to restore the previous binary.
    4. Updates state.current_version to reflect the rollback.
    5. systemctl --no-block restart archipelago so the OLD binary boots.
- Markers older than 10 minutes are treated as stale and cleared without
  probing, so a crashed-during-startup marker from weeks ago cannot
  spontaneously roll back a healthy node on a later reboot.
- rollback_update() binary copy now goes through host_sudo instead of
  tokio::fs::copy, so it escapes the service's ProtectSystem=strict
  mount namespace. Without this, the rollback silently failed with
  EROFS on /usr/local/bin and orphaned the rollback - the exact
  opposite of what auto-rollback is for.

Tests: 4 new unit tests in update::tests covering marker round-trip,
absent-marker noop, no-panic on verify_pending_update with nothing to
verify, and an invariant assert that the 90s probe window stays below
the 600s stale threshold. All passing.

Side fix: scripts/create-release-manifest.sh was dying with exit 141
(SIGPIPE from tar tvzf pipe head pipe awk) under set -euo pipefail.
Replaced with a single awk NR==1 that doesn't short-circuit the upstream
pipe, so the release-build flow is idempotent again.
This commit is contained in:
archipelago
2026-04-22 16:14:35 -04:00
parent 50744952b7
commit 048679065e
11 changed files with 645 additions and 24 deletions

View File

@@ -108,7 +108,10 @@ if [ -z "$FRONTEND_ARCHIVE" ]; then
# Verify the archive root entry is world-readable before we
# declare success — catches regressions in tar-flag handling
# (BSD tar, busybox tar) that might silently drop --mode.
root_mode=$(tar tvzf "$FRONTEND_ARCHIVE" | head -1 | awk '{print $1}')
# SIGPIPE-safe: use awk to read only the first line and exit,
# then terminate the tar pipeline explicitly so `pipefail`+SIGPIPE
# don't kill the whole `set -euo pipefail` script.
root_mode=$(tar tvzf "$FRONTEND_ARCHIVE" 2>/dev/null | awk 'NR==1{print $1; exit}')
case "$root_mode" in
drwxr-xr-x|drwxr-x*x*)
echo " Tarball root perms OK: $root_mode"