test: add reboot survival test script (REBOOT-01)

Creates scripts/test-reboot-survival.sh with TAP format output.
Records pre-reboot containers, reboots node, waits for SSH + health,
verifies container count/state/health. 6 checks per iteration.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
Dorian
2026-03-14 02:52:55 +00:00
parent 6335ea17ee
commit f8fdf05ff6
2 changed files with 217 additions and 1 deletions

View File

@@ -229,7 +229,7 @@ Every test must pass **10 consecutive times** from BOTH .228→.198 AND .198→.
### Sprint 7: Zero-Downtime Reboot Testing
- [ ] **REBOOT-01** — Create reboot survival test script. `scripts/test-reboot-survival.sh` that: (1) Records all container names and states, (2) Reboots the node via `sudo reboot`, (3) Waits for SSH to come back (poll every 10s, max 180s), (4) Verifies ALL containers are running, (5) Verifies health endpoint returns OK, (6) Verifies no containers have restart counts > 0 since boot. Run on .228. **Acceptance**: Script passes. All containers survive reboot.
- [x] **REBOOT-01** — Created `scripts/test-reboot-survival.sh`. TAP-format output with `--node`, `--iterations`, `--rest-between` flags. Records pre-reboot containers, reboots via sudo, waits for SSH (180s max) + health (120s max) + container stabilization (120s), verifies: container count recovered, no exited, all pre-reboot containers back, health OK, no restart loops. 6 checks per iteration.
- [ ] **REBOOT-02** — Run reboot survival test 10 times on .228. Execute test-reboot-survival.sh 10 times with 5-minute rest between reboots. Track: time to full recovery, any containers that fail to start, any services that don't come back. **Acceptance**: 10/10 reboots recover fully within 120s. Zero failed containers.