release(v1.7.41-alpha): post-OTA auto-rollback so a bad release cannot strand the fleet

Closes failure mode FM5 from docs/bulletproof-containers.md: the v1.7.38 +
v1.7.39 rollouts left every affected node on an unreachable UI (nginx 500)
with no recovery path short of SSH. This release adds a self-check
guardrail to the update flow.

What changed:
- apply_update() writes a pending-verify marker with old+new version and
  a 150s deadline immediately before scheduling the service restart.
- verify_pending_update() runs from main.rs startup. If the marker is
  present and within its freshness window, the new binary waits 15s for
  nginx + backend to settle, then probes https://127.0.0.1/ every 5s for
  up to 90s (self-signed certs accepted).
- On any probe success within the window, the marker is cleared and
  nothing else happens.
- On window-exhaust, the new binary:
    1. Moves the broken /opt/archipelago/web-ui to web-ui.failed.<ts>
       (quarantined, not deleted, so we can post-mortem).
    2. Restores web-ui.bak on top of web-ui.
    3. Calls rollback_update() to restore the previous binary.
    4. Updates state.current_version to reflect the rollback.
    5. systemctl --no-block restart archipelago so the OLD binary boots.
- Markers older than 10 minutes are treated as stale and cleared without
  probing, so a crashed-during-startup marker from weeks ago cannot
  spontaneously roll back a healthy node on a later reboot.
- rollback_update() binary copy now goes through host_sudo instead of
  tokio::fs::copy, so it escapes the service's ProtectSystem=strict
  mount namespace. Without this, the rollback silently failed with
  EROFS on /usr/local/bin and orphaned the rollback - the exact
  opposite of what auto-rollback is for.

Tests: 4 new unit tests in update::tests covering marker round-trip,
absent-marker noop, no-panic on verify_pending_update with nothing to
verify, and an invariant assert that the 90s probe window stays below
the 600s stale threshold. All passing.

Side fix: scripts/create-release-manifest.sh was dying with exit 141
(SIGPIPE from tar tvzf pipe head pipe awk) under set -euo pipefail.
Replaced with a single awk NR==1 that doesn't short-circuit the upstream
pipe, so the release-build flow is idempotent again.
This commit is contained in:
archipelago
2026-04-22 16:14:35 -04:00
parent 50744952b7
commit 048679065e
11 changed files with 645 additions and 24 deletions

View File

@@ -1,7 +1,7 @@
{
"name": "neode-ui",
"private": true,
"version": "1.7.40-alpha",
"version": "1.7.41-alpha",
"type": "module",
"scripts": {
"start": "./start-dev.sh",

View File

@@ -180,6 +180,16 @@ init()
</button>
</div>
<div class="overflow-y-auto flex-1 min-h-0 space-y-6 pr-1">
<!-- v1.7.41-alpha -->
<div>
<div class="flex items-center gap-2 mb-3">
<span class="text-xs font-mono px-2 py-0.5 rounded bg-orange-500/20 text-orange-300">v1.7.41-alpha</span>
<span class="text-xs text-white/40">Apr 22, 2026</span>
</div>
<div class="space-y-3 text-sm text-white/80 pl-3 border-l border-white/10">
<p>Updates now self-check. After an update lands, the node probes its own web UI through nginx if the frontend isn't answering cleanly within 90 seconds, the node automatically rolls back to the previous version and restarts. A bad release can no longer leave the fleet stranded on an unreachable node.</p>
</div>
</div>
<!-- v1.7.40-alpha -->
<div>
<div class="flex items-center gap-2 mb-3">