feat: TASK-49 container reliability — tests, orchestration, MASTER_PLAN
- Add orchestration_tests.rs + mock_podman.rs (container unit tests) - Add container-tests.yml CI workflow - Add dev-container-test.sh for local testing - MASTER_PLAN.md: add TASK-49 (P0) with 6-phase plan - Login.vue: minor fixes from user testing - AppCard.vue: enter key handler fix Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -18,6 +18,7 @@
|
||||
| **TASK-12** | **Beta telemetry — reporter + toggle + collector POST** | **P1** | IN PROGRESS | - |
|
||||
| **TASK-39** | **Finish .198 rootless container migration** | **P1** | PLANNED | TASK-11 |
|
||||
| **TASK-42** | **LUKS2 full-partition encryption for /var/lib/archipelago/** | **P1** | IN PROGRESS | - |
|
||||
| **TASK-49** | **Container app reliability — bulletproof installs + recovery** | **P0** | PLANNED | - |
|
||||
| **BUG-44** | **App iframe shows blank/broken when container is starting or crashed** | **P2** | PLANNED | - |
|
||||
| **TASK-45** | **Deploy script: auto-chown data dirs after rootful→rootless migration** | **P2** | PLANNED | - |
|
||||
| **BUG-46** | **FileBrowser missing in unbundled ISO + Cloud auto-login broken** | **P1** | IN PROGRESS | - |
|
||||
@@ -149,6 +150,99 @@ Encrypt all Archipelago app data at rest using LUKS2 full-partition encryption.
|
||||
- `core/archipelago/src/api/rpc/system.rs` — password change handler
|
||||
- `core/archipelago/src/server.rs` — startup checks
|
||||
|
||||
### TASK-49: Container app reliability — bulletproof installs + recovery (PLANNED)
|
||||
**Priority**: P0 — Critical
|
||||
**Status**: PLANNED (2026-03-29)
|
||||
|
||||
Every marketplace app must install cleanly, survive failures, auto-recover from unhealthy states, and uninstall without residue. Currently: some apps fail silently, health checks are inconsistent, and there's no systematic testing.
|
||||
|
||||
**Scope**: All 25+ marketplace apps — install, health, restart, uninstall, dependency chains.
|
||||
|
||||
#### Phase A: Audit & Fix Install Flow (Days 1-2)
|
||||
Test every app install on a fresh .198 node. Fix failures as found.
|
||||
|
||||
- [ ] **A1**: Create install test matrix — spreadsheet of all apps with columns: installs?, starts?, healthy?, UI loads?, uninstalls?, deps correct?
|
||||
- [ ] **A2**: Test core apps: Bitcoin Knots, LND, Mempool, BTCPay, Electrumx, FileBrowser
|
||||
- [ ] **A3**: Test recommended apps: Fedimint, Vaultwarden, Grafana, SearXNG, Tailscale, Portainer
|
||||
- [ ] **A4**: Test optional apps: Home Assistant, Jellyfin, PhotoPrism, Nextcloud, Ollama, Immich, Penpot, OnlyOffice
|
||||
- [ ] **A5**: Test web-only/L484 apps: noStrudel, BotFights, NWNN, IndeedHub, DWN
|
||||
- [ ] **A6**: Test Nostr relay (nostr-rs-relay) install + relay functionality
|
||||
- [ ] **A7**: Fix all install failures found in A2-A6
|
||||
|
||||
#### Phase B: Health Checks & Restart Policies (Days 2-3)
|
||||
Ensure every container has proper health checks and restart policies.
|
||||
|
||||
- [ ] **B1**: Audit all container manifests for `--health-cmd`, `--health-interval`, `--health-retries`
|
||||
- [ ] **B2**: Add health checks to containers missing them (curl endpoint or process check)
|
||||
- [ ] **B3**: Verify `--restart unless-stopped` on all containers
|
||||
- [ ] **B4**: Test failure recovery: `podman kill <container>` → verify auto-restart
|
||||
- [ ] **B5**: Test OOM recovery: set low memory limit → trigger OOM → verify restart
|
||||
- [ ] **B6**: Verify container-doctor.sh runs on timer and fixes unhealthy containers
|
||||
- [ ] **B7**: Verify reconcile-containers.sh detects and recreates missing containers
|
||||
|
||||
#### Phase C: Dependency Chain Validation (Day 3)
|
||||
Apps with dependencies (BTCPay→Bitcoin+Postgres, Mempool→Bitcoin+MariaDB) must handle missing deps gracefully.
|
||||
|
||||
- [ ] **C1**: Map all dependency chains (which app needs which)
|
||||
- [ ] **C2**: Test installing dependent app without dependency → verify error message
|
||||
- [ ] **C3**: Test stopping dependency while dependent is running → verify graceful degradation
|
||||
- [ ] **C4**: Test restarting dependency → verify dependent reconnects automatically
|
||||
- [ ] **C5**: Ensure backend `dependency_resolver.rs` handles all chains correctly
|
||||
|
||||
#### Phase D: Uninstall & Cleanup (Day 4)
|
||||
Every app must uninstall cleanly — no orphaned volumes, networks, or config.
|
||||
|
||||
- [ ] **D1**: Test uninstall for each app — verify container, volumes, config removed
|
||||
- [ ] **D2**: Verify no orphaned podman volumes after uninstall (`podman volume ls`)
|
||||
- [ ] **D3**: Verify no orphaned networks after uninstall
|
||||
- [ ] **D4**: Test reinstall after uninstall — must work cleanly
|
||||
- [ ] **D5**: Fix any cleanup issues found
|
||||
|
||||
#### Phase E: Stress & Soak Testing (Day 5)
|
||||
Multi-day uptime test with all core apps running.
|
||||
|
||||
- [ ] **E1**: Install all core + recommended apps on .198
|
||||
- [ ] **E2**: Let run for 24h — check for crashes, memory leaks, disk growth
|
||||
- [ ] **E3**: Simulate power failure (hard reboot) — verify all apps come back
|
||||
- [ ] **E4**: Simulate network failure — verify apps recover when network returns
|
||||
- [ ] **E5**: Run container-doctor after soak test — should report all healthy
|
||||
|
||||
#### Phase E2: FileBrowser Auto-Login (Day 5)
|
||||
FileBrowser must auto-login seamlessly after install — user should never see a separate login screen. Still protected via nginx session cookie validation.
|
||||
|
||||
- [ ] **E2a**: Fix FileBrowser auto-login flow: nginx auth_request validates Archipelago session, injects FileBrowser auth token
|
||||
- [ ] **E2b**: Verify auto-login works on fresh bundled install (first boot)
|
||||
- [ ] **E2c**: Verify auto-login works on unbundled install (Marketplace install)
|
||||
- [ ] **E2d**: Verify FileBrowser is NOT accessible without valid Archipelago session (security)
|
||||
- [ ] **E2e**: Test auto-login after session expiry → re-login to Archipelago → FileBrowser works again
|
||||
|
||||
#### Phase F: Frontend UX (Day 5-6)
|
||||
The UI must accurately reflect container state at all times.
|
||||
|
||||
- [ ] **F1**: Installing state persists across navigation (DONE — TASK-49 server store)
|
||||
- [ ] **F2**: App card shows correct state: stopped, starting, running, unhealthy, crashed
|
||||
- [ ] **F3**: App iframe shows contextual error when container is down (BUG-44)
|
||||
- [ ] **F4**: Uninstall progress shown in My Apps
|
||||
- [ ] **F5**: Error toast when install fails with actionable message
|
||||
|
||||
**Key files**:
|
||||
- `core/archipelago/src/container/` — PodmanClient, manifests, health
|
||||
- `core/archipelago/src/api/rpc/package/` — install/uninstall RPC handlers
|
||||
- `scripts/container-doctor.sh` — health check + auto-fix
|
||||
- `scripts/reconcile-containers.sh` — recreate missing containers
|
||||
- `scripts/image-versions.sh` — pinned image versions
|
||||
- `scripts/first-boot-containers.sh` — first-boot container creation
|
||||
- `neode-ui/src/views/marketplace/` — install UI
|
||||
- `neode-ui/src/views/apps/` — My Apps state display
|
||||
|
||||
**Testing approach**:
|
||||
- Fresh .198 install as test bed
|
||||
- SSH in, run installs via web UI, check with `podman ps -a`
|
||||
- Automated: `scripts/container-doctor.sh --local` after each test
|
||||
- Manual: kill containers, pull power, break networks, verify recovery
|
||||
|
||||
---
|
||||
|
||||
### BUG-44: App iframe shows blank/broken when container is starting or crashed (PLANNED)
|
||||
**Priority**: P2 — Medium
|
||||
**Status**: PLANNED (2026-03-21)
|
||||
|
||||
Reference in New Issue
Block a user