From ee825cd8d68725e00af284158fede94a2624b34f Mon Sep 17 00:00:00 2001 From: Dorian Date: Sat, 14 Mar 2026 03:34:16 +0000 Subject: [PATCH] =?UTF-8?q?chore:=20mark=20REBOOT-03=20blocked=20=E2=80=94?= =?UTF-8?q?=20.198=20crash=20recovery=20too=20slow?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit .198 crash recovery takes >120s for 34 containers. SSH returns reliably (125-145s) but backend health timeout exceeded on all 3 iterations. Needs CONT-02 deployment and/or increased timeout. Co-Authored-By: Claude Opus 4.6 (1M context) --- loop/plan.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/loop/plan.md b/loop/plan.md index 2a6dfceb..ce64d372 100644 --- a/loop/plan.md +++ b/loop/plan.md @@ -233,7 +233,7 @@ Every test must pass **10 consecutive times** from BOTH .228→.198 AND .198→. - [x] **REBOOT-02** — Ran reboot survival test 3x on .228. 21/21 checks passed. All 3 reboots: 32/32 containers survive, 0 exited, all containers back, health OK, no restart loops. SSH recovery: 130-145s. Health available: 5s after SSH. Total recovery ~255-270s (includes 120s stabilization wait). Zero failures. -- [ ] **REBOOT-03** — Run reboot survival test 10 times on .198. Same as REBOOT-02 but on .198. **Acceptance**: 10/10 reboots recover fully. Zero failed containers. +- [ ] **REBOOT-03** — (BLOCKED: .198 crash recovery takes >120s for 34 containers — health timeout exceeded on all 3 reboot iterations. SSH returns in 125-145s but backend startup blocked by sequential container recovery. Needs CONT-02 deployment to .198 and/or increased health wait timeout. 3/6 checks passed — SSH comes back reliably.) - [ ] **REBOOT-04** — Test simultaneous reboot of both nodes. Reboot .228 and .198 at the same time. After both recover, verify: federation re-establishes, DWN sync works, file sharing works. **Acceptance**: Both nodes fully recover. Federation sync succeeds within 10 minutes of both being back.