fix: overhaul container lifecycle — recovery, health, uninstall, UI state

Container recovery: - Health monitor: MAX_RESTART_ATTEMPTS 3→10, interval 60s→120s - Dependency-aware restarts: won't restart services before their deps - Reset dependent counters when a dependency recovers - Handle "created" state containers (were invisible to health monitor) - Added IndeedHub, mempool-api, mysql to tier system - Crash recovery: podman start timeout 30s→120s with retry - Podman client: socket timeout 5s→30s, added restart policy UI state representation: - Exit code 0 shows "stopped" (gray), not "crashed" (red) - Exit code 137 shows "killed (OOM)" - Non-zero exit shows "crashed" (red) - Added exit_code field to PackageDataEntry Install/uninstall fixes: - Install returns error when container doesn't start (was silent success) - Post-install hooks awaited instead of fire-and-forget tokio::spawn - Uninstall: graceful rm before force, volume prune, network cleanup - Uninstall returns error on partial failure (was 200 OK) Config consistency: - DB passwords read from /var/lib/archipelago/secrets/ (was hardcoded) - Bitcoin: added ZMQ ports 28332/28333 for LND block notifications - IndeedHub port 7777→8190 (was conflicting with strfry) - Marketplace versions: LND 0.17.4→0.18.4, Mempool 2.5.0→3.0.0 Performance: - Metrics collector interval 60s→300s (was duplicating health monitor) - Podman client: proper error propagation instead of unwrap_or_default Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 07:03:57 +01:00
parent 795e74bc50
commit 1e283daf13
65 changed files with 3950 additions and 298 deletions
--- a/docs/CONTAINER-ISSUES-REPORT.md
+++ b/docs/CONTAINER-ISSUES-REPORT.md
@@ -0,0 +1,508 @@
+# Archipelago Container Infrastructure — Critical Issues Report
+
+**Date:** 2026-03-31
+**Status:** Server .228 rebooted — some apps recovered, many did not. UI showed everything as "crashed" during recovery window.
+**Purpose:** Fix guide for getting container lifecycle to production quality.
+
+---
+
+## Executive Summary
+
+The container system has **7 systemic failures** that compound each other:
+
+1. **Silent failures everywhere** — errors are swallowed with `|| true`, `.unwrap_or_default()`, and warn-level logs. Nothing actually tells the user (or the system) that something broke.
+2. **Health checks are fake** — manifests define real health checks (HTTP probes, exec checks) but they are **never executed**. "Healthy" just means `podman ps` shows "running".
+3. **Duplicate polling burns CPU** — health monitor + metrics collector both call `podman stats` every 60 seconds independently. Add crash recovery snapshots, disk monitor, and frontend polling = constant subprocess spawning.
+4. **Uninstall doesn't clean up** — no volume removal, no network cleanup, force-kills stateful containers (risking wallet/DB corruption), returns 200 OK on partial failure.
+5. **Two divergent install paths** — `first-boot-containers.sh` and the Rust RPC installer use different passwords, ports, capabilities, memory limits, and Bitcoin config. They are never in sync.
+6. **UI misrepresents state** — `Exited` (even clean exit code 0) shows as "crashed". No "recovering" or "starting up" state exists. During boot recovery, UI shows a wall of red/gray "crashed" labels.
+7. **Dependency-blind restarts** — health monitor restarts services without restarting their dependencies first, so they immediately fail again and burn through the 3-attempt limit.
+
+---
+
+## LIVE EVIDENCE: .228 Reboot on 2026-03-31
+
+After rebooting .228, here's the actual container state 30 minutes later:
+
+### Permanently Dead (exceeded 3 restart attempts, abandoned)
+| Container | Exit Code | Cause |
+|-----------|-----------|-------|
+| `indeedhub-postgres` | 0 (clean) | Shut down by reboot. Health monitor tried 3 restarts, it keeps exiting cleanly. Once abandoned, all dependent services die too. |
+| `indeedhub-redis` | 0 | Same — clean exit, 3 failed restart attempts, abandoned |
+| `indeedhub-minio` | 0 | Same |
+| `indeedhub-relay` | 0 | Same |
+| `indeedhub` | 0 | Same |
+| `indeedhub-api` | 1 | Can't resolve hostname `indeedhub-postgres` (postgres is dead, DNS entry gone from network) |
+| `jellyfin` | 137 (OOM) | "Failed to create CoreCLR" — memory limit too low for .NET runtime. SIGKILL = OOM. 3 attempts exhausted. |
+
+### Crash-Looping (still failing on every restart)
+| Container | Cause |
+|-----------|-------|
+| `mempool-api` | `ECONNREFUSED 10.89.0.42:3306` — DB (`archy-mempool-db`) just restarted, not ready yet |
+| `portainer` | "database schema version does not align with server version" — image upgraded, DB not migrated. Will NEVER recover. |
+| `photoprism` | "Failed creating test file in storage folder" — volume permission issue (rootless UID mapping) |
+
+### Never Started (stuck in "Created" state)
+| Container | Cause |
+|-----------|-------|
+| `archy-mempool-web` | "cannot assign requested address" — network binding failure |
+| `fedimint` | Same network error |
+
+### Running but Unhealthy
+| Container | Notes |
+|-----------|-------|
+| `homeassistant` | Up 14 min, health check failing |
+| `searxng` | Up 13 min, health check failing |
+| `onlyoffice` | Up 10 min, health check failing |
+
+### Actually Recovered (healthy)
+`filebrowser`, `bitcoin-knots`, `vaultwarden`, `nginx-proxy-manager`, `archy-btcpay-db`, `lnd`, `electrumx`, `grafana`
+
+### Key Observations
+1. **All containers have `unless-stopped` restart policy** — but this doesn't help because containers that exit cleanly (code 0) don't get restarted by Podman. The health monitor is the only restart mechanism, and it gives up after 3 attempts.
+2. **The entire IndeedHub stack died** because postgres was abandoned first. Once postgres hit 3 restart attempts, every dependent service (api, redis, minio, relay, main) also failed and hit their own 3-attempt limit. **No dependency awareness.**
+3. **Containers in "Created" state** were never even started — some kind of network assignment failure during creation. The health monitor doesn't handle "Created" state containers.
+4. **The UI showed ALL apps as "crashed"** during the first few minutes, even the ones that eventually recovered. This is because `Exited` state (even exit code 0) maps to the label "crashed" in `appsConfig.ts`.
+
+---
+
+## Problem 1: Containers Don't Start or Recover After Reboot
+
+**Confirmed:** All apps crashed after .228 reboot on 2026-03-31.
+
+### Root Causes
+
+#### A. Crash recovery has a 30-second timeout that's too short
+**File:** `core/archipelago/src/crash_recovery.rs:265-271`
+```rust
+let result = tokio::time::timeout(
+    std::time::Duration::from_secs(30),
+    tokio::process::Command::new("podman").args(["start", &record.name]).output(),
+).await;
+```
+On a cold boot with many containers, Podman is under load. 30 seconds is not enough. If it times out, the container is **skipped** — no retry.
+
+#### B. If `podman ps` itself times out, recovery finds zero containers
+**File:** `core/archipelago/src/crash_recovery.rs:318`
+The `podman ps -a` call to discover stopped containers has a 30-second timeout. On a busy system post-reboot, this can timeout. Result: `all_names` is empty, recovery silently exits having started nothing.
+
+#### C. Boot tier ordering uses a catch-all that misses dependencies
+**File:** `core/archipelago/src/crash_recovery.rs:374-385`
+```rust
+fn container_boot_tier(name: &str) -> u8 {
+    match id {
+        "btcpay-db" | "mempool-db" | ... => 0,  // databases
+        "bitcoin-knots" | ... => 1,               // bitcoin
+        "lnd" | "electrumx" | ... => 2,           // depends on bitcoin
+        "mempool-web" | ... => 4,                  // frontend
+        _ => 3,  // EVERYTHING ELSE - may start before its dependencies
+    }
+}
+```
+Any app not explicitly listed gets tier 3, which may be before its dependencies are ready.
+
+#### D. First-boot script swallows ALL errors
+**File:** `scripts/first-boot-containers.sh:8` — no `set -e`
+48+ commands have `|| true` appended. Every `podman run` failure is silently ignored. The script always exits 0 and reports "complete" to systemd even if 50% of containers failed.
+
+#### E. Install RPC returns success before container is actually running
+**File:** `core/archipelago/src/api/rpc/package/install.rs:260-294`
+After container creation, the installer polls for 30 seconds (6 checks x 5 seconds). If the container is still in "created" or "starting" state after 30 seconds:
+```rust
+if i == 5 {
+    debug!("Container {} health check timeout (30s) -- continuing anyway");
+}
+```
+It logs at debug level and **returns success**. The user sees "installed" but the container never actually started.
+
+### Fixes Required
+
+1. **Increase crash recovery timeout to 120s** and add retry with backoff (3 attempts per container)
+2. **Increase `podman ps` timeout to 60s** during boot recovery
+3. **Replace tier catch-all** — every container must be explicitly listed or derived from manifest dependencies
+4. **Remove `|| true`** from critical commands in first-boot-containers.sh. Use proper error handling: log the error, record the failure, continue to next container, but report actual failures at the end
+5. **Install RPC must return failure** if container isn't running after timeout, not silently succeed
+6. **Add `--restart unless-stopped`** to container creation in the Podman client (`core/container/src/podman_client.rs:303-335`) — currently missing, so Podman itself never auto-restarts crashed containers
+
+---
+
+## Problem 2: Health Checks Are Fake
+
+### Root Causes
+
+#### A. "Healthy" just means "running" — application health is never checked
+**File:** `core/archipelago/src/container/dev_orchestrator.rs:239-249`
+```rust
+pub async fn get_health_status(&self, app_id: &str) -> Result<String> {
+    match status.state {
+        ContainerState::Running => Ok("healthy".to_string()),  // <-- THIS IS THE ENTIRE CHECK
+        ContainerState::Stopped | ContainerState::Exited => Ok("unhealthy".to_string()),
+        ...
+    }
+}
+```
+A container can be "running" but the application inside is completely broken. This is reported as "healthy".
+
+#### B. Manifest health checks exist but are never executed
+All 30+ app manifests in `image-recipe/build/debian-iso/custom/archipelago/apps/*/manifest.yml` define health checks like:
+```yaml
+health_check:
+  type: http
+  endpoint: http://localhost:4080
+  path: /api/health
+  interval: 30s
+  timeout: 5s
+  retries: 3
+```
+The `HealthMonitor` struct at `core/container/src/health_monitor.rs` can execute these checks. **But it is never instantiated.** No code path creates a `HealthMonitor` from the manifest health check definitions.
+
+#### C. Health status is never pushed to the frontend via WebSocket
+**File:** `core/archipelago/src/data_model.rs:120-127`
+```rust
+pub struct PackageDataEntry {
+    pub health: Option<String>,  // Field exists but is NEVER POPULATED
+}
+```
+The health field in the data model is always `None`. Frontend can only get health via explicit RPC call, which it almost never makes.
+
+#### D. Frontend never polls health status
+**File:** `neode-ui/src/stores/container.ts:169-175`
+`fetchHealthStatus()` is only called after `startContainer()` and `startBundledApp()`. There is **no setInterval, no periodic polling, no watch**. After the initial call, health status is never refreshed.
+
+### Fixes Required
+
+1. **Wire up manifest health checks** — instantiate `HealthMonitor` from manifest definitions, run actual HTTP/exec probes instead of just checking `podman ps`
+2. **Populate the `health` field in `PackageDataEntry`** so WebSocket pushes real health status to frontend
+3. **Add 30-second health polling** in the frontend container store (with backoff to 60s when all healthy)
+4. **Fix `get_health_status()`** in dev_orchestrator to call actual health checks, not just check container state
+
+---
+
+## Problem 3: CPU Exhaustion from Duplicate Polling
+
+### Root Causes
+
+#### A. Two independent monitors both call `podman stats` every 60 seconds
+- **Health monitor:** `core/archipelago/src/health_monitor.rs:17` — `CHECK_INTERVAL_SECS = 60`
+  - Runs `podman ps -a --format json` (line 305-323)
+  - Runs `podman stats --no-stream` every 5 cycles (line 442-450)
+- **Metrics collector:** `core/archipelago/src/monitoring/mod.rs:28` — 60-second interval
+  - Runs `podman stats --no-stream --format json` independently (collector.rs:220-224)
+
+These are **not coordinated**. Both spawn separate subprocesses. On a system with 15+ containers, each `podman stats` call is expensive.
+
+#### B. Total subprocess spawning frequency
+| Component | Interval | What it runs |
+|-----------|----------|-------------|
+| Health monitor | 60s | `podman ps`, `podman stats` (every 5th), restart attempts |
+| Metrics collector | 60s | `podman stats` (duplicate!) |
+| Crash recovery snapshot | 120s | `podman ps` |
+| Disk monitor | 300s | `df`, `sudo dmesg`, potentially `podman image prune` |
+| Telemetry | 900s | `podman stats` (another duplicate) |
+| Systemd watchdog | 120s | sd_notify ping |
+| Frontend fleet polling | 60s | RPC calls that trigger more podman commands |
+
+That's roughly **one `podman` subprocess every 10-15 seconds** on average, plus all the triggered operations.
+
+#### C. No restart policy means polling-driven restarts
+**File:** `core/container/src/podman_client.rs:303-335`
+Container creation spec does NOT include `RestartPolicy`. Podman itself never restarts crashed containers. Instead, the health monitor's 60-second poll detects the crash and attempts a restart. This is far more CPU-intensive than Podman's built-in restart mechanism.
+
+#### D. Health monitor restart attempts with exponential backoff still spawn processes
+When a container fails, the health monitor tries restarts at 10s, 30s, 90s backoff. Each attempt spawns `podman start`, `podman inspect`, etc. If multiple containers are unhealthy, this multiplies.
+
+### Fixes Required
+
+1. **Deduplicate `podman stats`** — create a shared cache layer. One component fetches, others read from cache (TTL: 30s)
+2. **Add `RestartPolicy: unless-stopped` with MaxRetryCount: 5** to all container creation — let Podman handle restarts natively instead of polling
+3. **Increase health monitor interval to 120s** (60s is too aggressive when health checks are just `podman ps`)
+4. **Remove duplicate `podman stats`** call from metrics collector — share data with health monitor
+5. **Make frontend fleet polling viewport-aware** — only poll when user is actually viewing the fleet page
+6. **Batch all container queries** — use a single `podman ps -a --format json` per check cycle, shared across all consumers
+
+---
+
+## Problem 4: Uninstall Doesn't Work
+
+### Root Causes
+
+#### A. No volume removal
+**File:** `core/archipelago/src/api/rpc/package/runtime.rs:172-289`
+The uninstall function stops containers, removes containers, releases ports, and attempts data directory cleanup. It **never removes Podman volumes**. Orphaned volumes accumulate forever.
+
+#### B. No network cleanup
+**File:** `core/archipelago/src/api/rpc/package/runtime.rs:172-289`
+Multi-container stacks create networks (`archy-net`, `immich-net`, `penpot-net`) during install (`stacks.rs:89, 211`). These are **never cleaned up** during uninstall. Leftover networks can prevent reinstallation.
+
+#### C. Force-kills stateful containers without graceful shutdown
+**File:** `core/archipelago/src/api/rpc/package/runtime.rs:226`
+```rust
+let rm_out = tokio::process::Command::new("podman")
+    .args(["rm", "-f", name])  // -f = force kill
+    .output().await;
+```
+The code defines proper shutdown timeouts (Bitcoin: 600s, LND: 330s, databases: 120s) but only uses them for `stop`. The `rm -f` that follows **ignores these timeouts** and force-kills immediately. This risks corrupting Bitcoin's UTXO set, LND channel state, or database WAL.
+
+#### D. Returns 200 OK even on partial failure
+**File:** `core/archipelago/src/api/rpc/package/runtime.rs:268-289`
+```rust
+Ok(serde_json::json!({
+    "status": if errors.is_empty() { "uninstalled" } else { "partial" },
+    ...
+}))
+```
+Returns HTTP 200 with `"partial"` status. Frontend at `neode-ui/src/views/apps/useAppsActions.ts:74` doesn't check for "partial" — it deletes the app from the UI regardless.
+
+#### E. Data directory cleanup requires sudo and fails silently
+**File:** `core/archipelago/src/api/rpc/package/runtime.rs:256-265`
+```rust
+let rm_out = tokio::process::Command::new("sudo")
+    .args(["rm", "-rf", dir]).output().await;
+if let Ok(o) = rm_out {
+    if !o.status.success() {
+        tracing::warn!(...);  // Warning only, continues
+    }
+}
+```
+If sudo isn't configured or fails, data remains on disk but UI shows "uninstalled".
+
+#### F. Container name detection has gaps
+**File:** `core/archipelago/src/api/rpc/package/config.rs:287-340`
+Container names are hardcoded patterns. If a container was created with a different naming convention (e.g., by first-boot-containers.sh vs RPC installer), it won't be found and won't be removed.
+
+### Fixes Required
+
+1. **Add `podman volume rm`** for all volumes associated with the app after container removal
+2. **Add network cleanup** — remove app-specific networks after all containers on that network are gone
+3. **Use `podman stop -t {timeout}` then `podman rm`** (without -f) — respect graceful shutdown timeouts, especially for Bitcoin/LND/databases
+4. **Return an error (not 200)** when uninstall has failures. Frontend must check and display errors
+5. **Surface "partial" failures to the user** with specific error messages
+6. **Unify container naming** — derive names from a single source (manifest), not hardcoded patterns in multiple files
+
+---
+
+## Problem 5: Two Divergent Install Paths
+
+The first-boot bash script and the Rust RPC installer create containers with **different configurations**. This is a major source of bugs.
+
+### Specific Divergences
+
+#### A. Database passwords
+- **First-boot** (`scripts/first-boot-containers.sh:118-127`): Generates random passwords with `openssl rand -base64 24`, stores in `/var/lib/archipelago/secrets/`
+- **Rust RPC** (`core/archipelago/src/api/rpc/package/config.rs:456,484,514-515,610`): Uses hardcoded `"btcpaypass"`, `"mempoolpass"`, `"rootpass"`, `"immichpass"`
+
+**Result:** Apps installed via RPC after first-boot can't connect to databases because passwords don't match.
+
+#### B. Bitcoin configuration
+- **First-boot** (`scripts/first-boot-containers.sh:295-313`): Dynamically sets `-prune=550` on small disks, `-txindex=1` on large disks
+- **Rust RPC** (`core/archipelago/src/api/rpc/package/config.rs:415-420`): No custom args at all
+
+**Result:** Bitcoin installed via RPC has no pruning or txindex regardless of disk size.
+
+#### C. ZMQ configuration for LND
+- **First-boot** (`scripts/first-boot-containers.sh:100-114`): Bitcoin.conf generated without ZMQ publisher settings
+- **Rust RPC** (`core/archipelago/src/api/rpc/package/config.rs:438-439`): LND configured to connect to `tcp://bitcoin-knots:28332` and `tcp://bitcoin-knots:28333`
+
+**Result:** LND can't receive block notifications from Bitcoin because ZMQ isn't configured on either path.
+
+#### D. Port conflicts
+- **First-boot** (`scripts/first-boot-containers.sh:813,835`): Both strfry and indeedhub bind to host port 7777
+- **Rust RPC** (`core/archipelago/src/api/rpc/package/config.rs:734`): IndeedHub uses `8190:3000`
+
+**Result:** On first-boot, whichever of strfry/indeedhub starts second fails. Via RPC, different port entirely.
+
+#### E. Memory limits
+- **First-boot** (`scripts/first-boot-containers.sh:253-283`): Ollama gets 1g on low-mem systems
+- **Rust RPC** (`core/archipelago/src/api/rpc/package/config.rs:245-280`): Ollama gets 4g always
+
+**Result:** Same app gets different resource limits depending on how it was installed.
+
+#### F. Version mismatches in marketplace UI
+- `scripts/image-versions.sh:17`: LND image is `v0.18.4-beta`
+- `neode-ui/src/views/marketplace/marketplaceData.ts:155`: Shows `0.17.4`
+- `scripts/image-versions.sh:21-22`: Mempool images are `v3.0.0`
+- `neode-ui/src/views/marketplace/marketplaceData.ts:177`: Shows `2.5.0`
+
+### Fixes Required
+
+1. **Single source of truth for container config** — Rust config must read passwords from `/var/lib/archipelago/secrets/`, not hardcode them
+2. **Add ZMQ config** to Bitcoin startup in both paths: `zmqpubrawblock=tcp://0.0.0.0:28332` and `zmqpubrawtx=tcp://0.0.0.0:28333`
+3. **Fix port 7777 conflict** — assign unique ports to strfry and indeedhub
+4. **Add disk-aware Bitcoin config** to Rust installer (prune/txindex based on disk size)
+5. **Sync memory limits** between first-boot and Rust config
+6. **Update marketplace version strings** to match actual image versions in `image-versions.sh`
+7. **Long-term: eliminate first-boot-containers.sh** — have the backend handle all container creation using the same Rust code path
+
+---
+
+## Problem 6: Post-Install Hooks Run Async and Fail Silently
+
+**File:** `core/archipelago/src/api/rpc/package/install.rs:541-625`
+
+Post-install hooks (setting FileBrowser password, configuring NextCloud, etc.) are spawned as background tasks:
+```rust
+tokio::spawn(async move {
+    let _ = tokio::fs::create_dir_all(secret_dir).await;
+    let _ = tokio::fs::write(...).await;
+});
+```
+
+The install RPC returns success **before hooks complete**. If a hook fails (network timeout, service not ready), the error is logged but the user is told installation succeeded. Credentials aren't set, configs aren't applied.
+
+### Fix Required
+
+Await post-install hooks before returning success, or return a "configuring" status and let the frontend poll for completion.
+
+---
+
+## Problem 7: Podman Client Swallows Errors
+
+**File:** `core/container/src/podman_client.rs`
+
+#### A. JSON serialization failures return empty strings (line 182-183)
+```rust
+let body_str = body.map(|b| serde_json::to_string(&b).unwrap_or_default()).unwrap_or_default();
+```
+
+#### B. Container ID parsing failures return empty string (line 344-348)
+```rust
+let id = result["Id"].as_str().unwrap_or("").to_string();
+Ok(id)  // Empty string = success?
+```
+
+#### C. Socket timeout is only 5 seconds (line 154-160)
+On a busy system or during boot, Podman socket may take >5s to respond. Every API call fails. No retry logic.
+
+### Fixes Required
+
+1. Replace `.unwrap_or_default()` with proper error propagation using `?`
+2. Return `Err` when container ID is empty
+3. Increase socket timeout to 15-30s
+4. Add retry with backoff (3 attempts) on socket connection
+
+---
+
+## Problem 8: UI Misrepresents Container State
+
+### Root Causes
+
+#### A. "Exited" always displays as "Crashed" — even for clean shutdowns
+**File:** `neode-ui/src/views/apps/appsConfig.ts:119-146`
+```typescript
+getStatusLabel(state, health):
+  - "exited" → "crashed"     // <-- THIS IS THE PROBLEM
+```
+Every container that exited — whether from a clean reboot (exit 0), OOM kill (exit 137), or app error (exit 1) — shows the same "crashed" label. After a reboot, the UI is a wall of "crashed" labels even though containers are in the process of starting up.
+
+#### B. No "recovering" or "boot in progress" state exists
+**File:** `core/archipelago/src/data_model.rs:103-119`
+PackageState enum has `Starting`, but it's only set during **explicit user start actions**, not during automatic crash recovery. During boot recovery, containers transition from `Exited → Running` without ever passing through `Starting`, so the UI never shows a spinner or "starting up" message.
+
+#### C. Backend skips sub-containers from package listing, so their state is invisible
+**File:** `core/archipelago/src/container/docker_packages.rs:39-117`
+The excluded_services list filters out backend services like `mempool-db`, `btcpay-db`, `nbxplorer`, `penpot-postgres`, etc. UI containers ending in `-ui` are also skipped. These containers are invisible to the user even when they're the actual cause of a stack failure (e.g., `indeedhub-postgres` being dead kills the entire IndeedHub stack, but only `indeedhub-api` errors are visible).
+
+#### D. No distinction between "needs manual intervention" and "will recover soon"
+The UI shows the same visual treatment for:
+- Portainer (DB migration error — will NEVER recover without manual intervention)
+- mempool-api (DB not ready yet — will recover in 30 seconds)
+- IndeedHub (dependencies abandoned — won't recover until deps are manually restarted)
+
+### Fixes Required
+
+1. **Differentiate exit codes**: Exit 0 = "stopped" (gray), Exit non-zero = "crashed" (red), Exit 137 = "killed (OOM)" (red with warning)
+2. **Add a "recovering" state**: During boot/crash recovery window (first 5 minutes after backend start), show "Starting up..." instead of "crashed" for exited containers
+3. **Show sub-container health**: When a parent app is unhealthy, show which sub-service caused the failure (e.g., "IndeedHub: postgres is down")
+4. **Distinguish recoverable from permanent failures**: After health monitor gives up (3 attempts), change label to "Needs attention" instead of keeping "crashed"
+5. **Add recovery progress indicator**: During boot, show "Recovering containers: 15/22 started" on the dashboard
+
+---
+
+## Problem 9: Dependency-Blind Restarts
+
+### Root Cause (Confirmed by .228 reboot)
+
+The health monitor restarts containers individually without considering dependencies. This was proven by the IndeedHub stack failure:
+
+1. `indeedhub-postgres` exits cleanly (code 0) on reboot
+2. Health monitor restarts postgres — it starts, but exits again (likely needs volume mount or network ready)
+3. After 3 attempts, postgres is **abandoned**
+4. Meanwhile, `indeedhub-api` tries to connect to postgres → `ENOTFOUND indeedhub-postgres` → exits
+5. Health monitor restarts api → same DNS failure → exits
+6. After 3 attempts, api is **abandoned**
+7. Same cascade for redis, minio, relay, main container — all abandoned within minutes
+
+**File:** `core/archipelago/src/health_monitor.rs:500-530`
+The restart loop treats each container independently. There's no logic to:
+- Check if a container's dependencies are running before restarting it
+- Restart dependencies first when a dependent container fails
+- Reset attempt counters when a dependency comes back online
+
+**3 attempts is too few**, especially when dependencies need time:
+- Attempt 1: 10s backoff → dependency still starting
+- Attempt 2: 30s backoff → dependency crashed and is being restarted
+- Attempt 3: 90s backoff → dependency hit its own 3-attempt limit and was abandoned
+- Game over. Entire stack is dead.
+
+### Fixes Required
+
+1. **Dependency-aware restart ordering**: Before restarting a container, check if its dependencies are running. If not, restart dependencies first.
+2. **Increase max restart attempts to 5-10** for containers with dependencies
+3. **Reset attempt counters** when a dependency comes back online (the dependent container failed because of the dependency, not itself)
+4. **Add a "stack restart" concept**: When restarting any container in a multi-container stack (indeedhub, mempool, btcpay, immich, penpot), restart the entire stack in dependency order
+5. **Handle "Created" state containers**: `archy-mempool-web` and `fedimint` are in "Created" state (never started). The health monitor should detect these and attempt to start them.
+
+---
+
+## Priority Order for Fixes
+
+### P0 — System is broken without these (reboot = broken system)
+1. **Dependency-aware restarts** in health_monitor.rs — restart dependencies before dependents, reset attempt counters when deps recover
+2. **Increase max restart attempts to 10** (currently 3) — dependency chains need more time on boot
+3. **Handle "Created" state** — containers stuck in Created are never started by health monitor
+4. **Fix UI state labels** — "exited" code 0 should say "stopped", not "crashed". Add "recovering" state during boot window.
+5. Fix Rust config to read secrets from `/var/lib/archipelago/secrets/` instead of hardcoded passwords
+6. Fix port 7777 conflict (strfry vs indeedhub)
+7. Add ZMQ config to Bitcoin for LND block notifications
+
+### P1 — Core functionality broken
+8. Wire up manifest health checks (replace fake "running = healthy" with actual HTTP/exec probes)
+9. Fix uninstall to clean up volumes, networks, and respect graceful shutdown timeouts
+10. Return actual errors from install/uninstall instead of silent success on partial failure
+11. Remove `|| true` from critical first-boot commands
+12. Show sub-container health in UI (which dependency is actually broken)
+
+### P2 — Performance and CPU
+13. Deduplicate `podman stats` calls (health monitor + metrics collector both call every 60s independently)
+14. Increase health monitor interval to 120s
+15. Add frontend health polling via WebSocket push (populate `health` field in data model)
+16. Make fleet polling viewport-aware (don't poll when user isn't viewing)
+
+### P3 — Consistency and correctness
+17. Sync memory limits between first-boot and Rust config
+18. Update marketplace version strings (LND shows 0.17.4, actual is 0.18.4; Mempool shows 2.5.0, actual is 3.0.0)
+19. Unify container naming conventions between first-boot script and Rust config
+20. Add disk-aware Bitcoin config (prune/txindex) to Rust installer
+21. Distinguish "needs manual intervention" from "will recover soon" in UI
+
+---
+
+## Key Files to Modify
+
+| File | What to fix |
+|------|-------------|
+| `core/archipelago/src/health_monitor.rs` | Dependency-aware restarts, increase MAX_RESTART_ATTEMPTS to 10, handle Created state, deduplicate with metrics collector |
+| `core/container/src/podman_client.rs` | Add RestartPolicy to container creation spec, fix `.unwrap_or_default()` error swallowing, increase socket timeout to 15-30s |
+| `core/archipelago/src/crash_recovery.rs` | Increase timeouts to 120s, add retry with backoff, fix tier ordering catch-all |
+| `core/archipelago/src/api/rpc/package/install.rs` | Return failure on timeout (not silent success), await post-install hooks |
+| `core/archipelago/src/api/rpc/package/runtime.rs` | Add volume/network cleanup on uninstall, use `podman stop -t` then `podman rm` (not `-f`), return errors on partial failure |
+| `core/archipelago/src/api/rpc/package/config.rs` | Read secrets from disk, fix port 7777, add ZMQ config, sync memory limits |
+| `core/archipelago/src/container/dev_orchestrator.rs` | Wire up manifest-defined health checks instead of just checking podman state |
+| `core/archipelago/src/container/docker_packages.rs` | Stop filtering sub-containers from state — or expose their health as part of parent app status |
+| `core/archipelago/src/data_model.rs` | Populate `health` field for WebSocket push, add exit code to state |
+| `core/archipelago/src/monitoring/mod.rs` | Share podman stats data with health monitor instead of duplicate subprocess calls |
+| `neode-ui/src/views/apps/appsConfig.ts` | Fix state labels: exit 0 = "stopped", exit non-zero = "crashed", add "recovering" during boot window |
+| `neode-ui/src/stores/container.ts` | Add periodic health polling (30s) |
+| `neode-ui/src/views/apps/useAppsActions.ts` | Check for "partial" uninstall status, show errors to user |
+| `neode-ui/src/views/marketplace/marketplaceData.ts` | Fix version strings to match image-versions.sh |
+| `scripts/first-boot-containers.sh` | Remove `\|\| true` from critical commands, fix port 7777 conflict, add proper error reporting |
--- a/docs/GAMEPAD-NAV.md
+++ b/docs/GAMEPAD-NAV.md
@@ -0,0 +1,159 @@
+# Gamepad / Controller Navigation Map
+
+## Global Controls
+
+| Button | Action |
+|--------|--------|
+| D-pad Up/Down | Navigate between items |
+| D-pad Left | Go to sidebar (from any page) |
+| D-pad Right | Enter main content from sidebar |
+| Enter (A) | Activate / click focused element |
+| Escape (B) | Go back one level (inner → container → sidebar → detail page back) |
+
+## Navigation Layers
+
+```
+SIDEBAR ──Right──► CONTAINERS (or NAV BAR) ──Enter──► INNER CONTROLS
+   ▲                      ▲                                  │
+   └──Escape──────────────┘◄─────────Escape──────────────────┘
+```
+
+### Sidebar
+- **Up/Down**: Move between sidebar items (wraps), auto-navigates links
+- **Right**: Jump to main content (first container, or first button on container-free pages)
+- **Left**: Nothing
+
+### Nav Bar (mode-switcher tabs, category buttons)
+- **Left/Right**: Move between tabs
+- **Down**: Jump to first container below (remembers which tab for Up return)
+- **Up**: Nothing (Escape to go to sidebar)
+- **Left from leftmost**: Go to sidebar
+
+### Container Grid (card tiles on most pages)
+- **Arrows**: Spatial nav between containers
+- **Enter**: Activate primary action (Install/Launch/navigate) or enter inner controls
+- **Escape**: Go to sidebar
+- **Left from leftmost**: Go to sidebar
+- **Up from top row**: Return to remembered nav bar tab, or spatial to nearest nav item
+
+### Inside Container (inner buttons after Enter)
+- **Arrows**: Move between inner controls
+- **Escape**: Exit back to the container tile
+
+### Text Inputs
+- **Up/Down**: Exit field, navigate spatially
+- **Enter**: Submit (click adjacent button)
+- **Left/Right**: Cursor movement (exit at edges)
+
+### Container-Free Pages (Settings)
+- **Right from sidebar**: Focus first button immediately (no 1s poll delay)
+- **Up/Down**: Linear navigation through all buttons/toggles
+- **Left**: Go to sidebar
+- **Escape**: Go to sidebar
+
+---
+
+## Per-Page Mappings
+
+### Home (`/dashboard`)
+Container grid. Dashboard info cards.
+
+### My Apps (`/dashboard/apps`)
+| # | Element | Type |
+|---|---------|------|
+| Nav | My Apps / App Store / Services tabs | Nav bar (Left/Right) |
+| 1–N | App cards (grid) | Containers — Enter to view details, inner Launch/Stop/Restart buttons |
+
+### App Store / Discover (`/dashboard/discover`)
+| # | Element | Type |
+|---|---------|------|
+| Nav | My Apps / App Store / Services tabs | Nav bar (Left/Right) |
+| 1–2 | Sovereignty Stack featured cards | Containers (`glass-card transition-all hover:-translate-y-1`) |
+| 3–N | All Applications grid cards | Containers — Enter for details, inner Install/Launch buttons |
+
+### Network (`/dashboard/server`)
+| # | Element | Type |
+|---|---------|------|
+| 1 | Quick Actions card | Single container — Enter to access Restart/Check Tor/View Logs buttons |
+| 2 | Local Network card | Container |
+| 3 | Web3 card | Container |
+| 4 | Network Interfaces card | Container |
+| 5 | Tor Services card | Container |
+
+### Mesh (`/dashboard/mesh`)
+| # | Element | Type |
+|---|---------|------|
+| 1 | Device status card | Container (left column) |
+| 2 | Actions row (Enable/Broadcast/Off-Grid/Refresh) | Container |
+| 3 | Peers list card | Container — Enter peer to open chat, inner peer items navigable |
+| 4 | Chat panel | Container (right column) — message input + send |
+| 5+ | Tool panels (Bitcoin/Dead Man/Map) | Containers |
+
+**Chat flow**: Select peer (Enter) → focus auto-jumps to message input → type → Enter sends.
+
+### Cloud (`/dashboard/cloud`)
+Container grid. Folder/file cards.
+
+### Settings (`/dashboard/settings`)
+**Container-free page** — linear button navigation, no containers.
+
+| # | Element | Section |
+|---|---------|---------|
+| 1 | Server Name input + save | Account Info |
+| 2 | What's New button | Account Info |
+| 3 | Copy DID button | Account Info |
+| 4 | Copy Onion Address button | Account Info |
+| 5 | Change Password button | Account → opens modal |
+| 6 | Enable 2FA / Disable 2FA button | Account |
+| 7 | Logout button | Account |
+| 8 | Language selector buttons | Interface Mode |
+| 9 | Login with Claude button | Claude Auth |
+| 10 | Enable All / toggle per-category | AI Data Access |
+| 11 | Manage Updates button | System Updates |
+| 12 | Webhook URL input | Webhooks |
+| 13 | Secret input | Webhooks |
+| 14 | Container Crash / Update Available toggles | Webhooks |
+| 15 | Disk Space Warning / Backup Complete toggles | Webhooks |
+| 16 | Save Configuration / Send Test buttons | Webhooks |
+| 17 | Enable Beta Telemetry button | Telemetry |
+| 18 | Create Backup button | Backup |
+| 19 | Export Channel Backup button | Backup |
+| 20 | Network Diagnostics button | Danger Zone |
+| 21 | Reboot button | Danger Zone → confirms with modal |
+| 22 | Factory Reset button | Danger Zone → confirms with modal |
+
+### Monitoring (`/dashboard/monitoring`)
+Container grid. Stats/chart cards.
+
+---
+
+## Focus Memory
+
+| Key | Remembers | Used When |
+|-----|-----------|-----------|
+| `sidebar` | Last sidebar item | Returning to sidebar via Escape/Left |
+| `main` | Last focused container | Re-entering main zone |
+| `navBar` | Last focused tab/button | Up from container returns to same tab |
+
+All focus memory is cleared on route change.
+
+## Data Attributes
+
+| Attribute | Purpose |
+|-----------|---------|
+| `data-controller-zone="main"` | Main content area (on `<main>`) |
+| `data-controller-zone="sidebar"` | Sidebar navigation |
+| `data-controller-container` | Focusable card/tile (with `tabindex="0"`) |
+| `data-controller-install` | Container has an Install button (Enter prioritizes it) |
+| `data-controller-launch` | Container has a Launch button (Enter prioritizes it) |
+| `data-controller-install-btn` | The actual Install button inside a container |
+| `data-controller-launch-btn` | The actual Launch button inside a container |
+| `data-controller-ignore` | Skip this element and descendants from navigation |
+| `data-controller-focus` | Make non-standard element focusable |
+
+## Implementation
+
+- **File**: `neode-ui/src/composables/useControllerNav.ts`
+- **Store**: `neode-ui/src/stores/controller.ts` (tracks active state + gamepad count)
+- **Sounds**: `neode-ui/src/composables/useNavSounds.ts` (move/action/back)
+- **Spatial nav**: `findNearestInDirection()` — filters by direction, scores by overlap + distance
--- a/docs/SEED-VERIFICATION.md
+++ b/docs/SEED-VERIFICATION.md
@@ -0,0 +1,443 @@
+# Archipelago Seed Verification
+
+Independently verify that your 24-word BIP-39 mnemonic produces the correct
+Nostr keys and DID identifiers — using only standard cryptographic primitives,
+no Archipelago code.
+
+```
+24-word mnemonic
+      |
+      v
+PBKDF2-HMAC-SHA512 (2048 rounds, salt = "mnemonic")
+      |
+      v
+64-byte master seed
+      |
+      +-- HKDF-SHA256 (info="archipelago/node/ed25519/v1")
+      |       --> Node Ed25519 keypair --> did:key:z...
+      |
+      +-- HKDF-SHA256 (info="archipelago/nostr-node/secp256k1/v1")
+      |       --> Node Nostr key --> npub1...
+      |
+      +-- HKDF-SHA256 (info="archipelago/identity/{i}/ed25519/v1")
+      |       --> Identity[i] Ed25519 --> did:key:z...
+      |
+      +-- BIP-32 m/44'/1237'/0'/0/{i}  (NIP-06)
+      |       --> Identity[i] Nostr key --> npub1...
+      |
+      +-- BIP-32 m/84'/0'/0'
+      |       --> Bitcoin HD wallet
+      |
+      +-- HKDF-SHA256 (info="archipelago/lnd/entropy/v1")
+              --> 16 bytes LND aezeed entropy
+```
+
+Source: [`core/archipelago/src/seed.rs`](../core/archipelago/src/seed.rs) and
+[`core/archipelago/src/identity.rs`](../core/archipelago/src/identity.rs)
+
+---
+
+## Setup
+
+```bash
+pip3 install cryptography ecdsa
+```
+
+Two packages, both pure crypto, no network calls. Python 3.9+.
+
+---
+
+## The Verification Script
+
+Save as `verify-seed.py` and run with your mnemonic:
+
+```bash
+MNEMONIC="word1 word2 ... word24" python3 verify-seed.py
+```
+
+```python
+#!/usr/bin/env python3
+"""
+Archipelago seed derivation verifier.
+
+Re-derives every key from a BIP-39 mnemonic using the exact same algorithms
+as the Rust backend (seed.rs), so you can compare outputs independently.
+
+Dependencies: cryptography, ecdsa  (pip3 install cryptography ecdsa)
+No network calls. No file writes. Safe to run air-gapped.
+"""
+
+import hashlib, hmac, os, sys
+
+# ── BIP-39: mnemonic --> 64-byte master seed ─────────────────────────────
+
+def mnemonic_to_seed(mnemonic: str) -> bytes:
+    """PBKDF2-HMAC-SHA512, 2048 rounds, salt = 'mnemonic', no passphrase."""
+    return hashlib.pbkdf2_hmac(
+        "sha512",
+        mnemonic.encode("utf-8"),
+        b"mnemonic",  # BIP-39 salt prefix + empty passphrase
+        2048,
+    )
+
+# ── HKDF-SHA256 (RFC 5869) ──────────────────────────────────────────────
+
+def hkdf_sha256(ikm: bytes, info: bytes, length: int = 32) -> bytes:
+    """
+    HKDF-Extract(salt=None, ikm) then HKDF-Expand(PRK, info, L).
+    Salt=None means 32 zero bytes per RFC 5869 section 2.2.
+    Matches: hkdf::Hkdf::<Sha256>::new(None, ikm).expand(info, &mut okm)
+    """
+    # Extract
+    prk = hmac.new(b"\x00" * 32, ikm, hashlib.sha256).digest()
+    # Expand (32 bytes = 1 block, only T(1) needed)
+    t1 = hmac.new(prk, info + b"\x01", hashlib.sha256).digest()
+    return t1[:length]
+
+# ── Ed25519 ──────────────────────────────────────────────────────────────
+
+from cryptography.hazmat.primitives.asymmetric.ed25519 import Ed25519PrivateKey
+from cryptography.hazmat.primitives.serialization import Encoding, PublicFormat
+
+def ed25519_keypair(secret_32: bytes) -> tuple[bytes, bytes]:
+    """Returns (private_32, public_32) from a 32-byte seed."""
+    sk = Ed25519PrivateKey.from_private_bytes(secret_32)
+    pk = sk.public_key().public_bytes(Encoding.Raw, PublicFormat.Raw)
+    return secret_32, pk
+
+# ── secp256k1 ────────────────────────────────────────────────────────────
+
+from ecdsa import SECP256k1, SigningKey as ECDSASigningKey
+
+def secp256k1_xonly(secret_32: bytes) -> bytes:
+    """32-byte x-only pubkey (Schnorr/Nostr format) from private key bytes."""
+    sk = ECDSASigningKey.from_string(secret_32, curve=SECP256k1)
+    point = sk.get_verifying_key().pubkey.point
+    return point.x().to_bytes(32, "big")
+
+# ── BIP-32 HD derivation (secp256k1) ────────────────────────────────────
+
+import struct
+
+SECP256K1_N = SECP256k1.order
+
+def _bip32_master(seed: bytes) -> tuple[bytes, bytes]:
+    """BIP-32 master key: HMAC-SHA512(key='Bitcoin seed', data=seed)."""
+    I = hmac.new(b"Bitcoin seed", seed, hashlib.sha512).digest()
+    return I[:32], I[32:]  # (secret, chain_code)
+
+def _bip32_ckd(key: bytes, chain: bytes, index: int) -> tuple[bytes, bytes]:
+    """Child key derivation (private -> private)."""
+    if index >= 0x80000000:
+        data = b"\x00" + key + struct.pack(">I", index)
+    else:
+        # Compressed pubkey for non-hardened
+        sk = ECDSASigningKey.from_string(key, curve=SECP256k1)
+        pt = sk.get_verifying_key().pubkey.point
+        prefix = b"\x02" if pt.y() % 2 == 0 else b"\x03"
+        data = prefix + pt.x().to_bytes(32, "big") + struct.pack(">I", index)
+
+    I = hmac.new(chain, data, hashlib.sha512).digest()
+    child = (int.from_bytes(I[:32], "big") + int.from_bytes(key, "big")) % SECP256K1_N
+    return child.to_bytes(32, "big"), I[32:]
+
+def bip32_derive(seed: bytes, path: str) -> bytes:
+    """
+    Derive private key for a BIP-32 path like 'm/44h/1237h/0h/0/0'.
+    Matches: bitcoin::bip32::Xpriv::new_master + derive_priv
+    """
+    key, chain = _bip32_master(seed)
+    for part in path.lstrip("m/").split("/"):
+        hardened = part.endswith("'") or part.endswith("h")
+        idx = int(part.rstrip("'h"))
+        if hardened:
+            idx += 0x80000000
+        key, chain = _bip32_ckd(key, chain, idx)
+    return key
+
+# ── Bech32 encoding (NIP-19: npub / nsec) ───────────────────────────────
+
+_BECH32 = "qpzry9x8gf2tvdw0s3jn54khce6mua7l"
+
+def _bech32_polymod(values):
+    GEN = [0x3B6A57B2, 0x26508E6D, 0x1EA119FA, 0x3D4233DD, 0x2A1462B3]
+    chk = 1
+    for v in values:
+        b = chk >> 25
+        chk = ((chk & 0x1FFFFFF) << 5) ^ v
+        for i in range(5):
+            chk ^= GEN[i] if ((b >> i) & 1) else 0
+    return chk
+
+def bech32_encode(hrp: str, data: bytes) -> str:
+    """Bech32 encode (NIP-19 for npub1.../nsec1...)."""
+    # Convert 8-bit to 5-bit
+    acc, bits, vals = 0, 0, []
+    for byte in data:
+        acc = (acc << 8) | byte
+        bits += 8
+        while bits >= 5:
+            bits -= 5
+            vals.append((acc >> bits) & 31)
+    if bits:
+        vals.append((acc << (5 - bits)) & 31)
+    # Checksum
+    hrp_exp = [ord(c) >> 5 for c in hrp] + [0] + [ord(c) & 31 for c in hrp]
+    polymod = _bech32_polymod(hrp_exp + vals + [0]*6) ^ 1
+    checksum = [(polymod >> 5*(5-i)) & 31 for i in range(6)]
+    return hrp + "1" + "".join(_BECH32[d] for d in vals + checksum)
+
+# ── did:key encoding ────────────────────────────────────────────────────
+
+_B58 = "123456789ABCDEFGHJKLMNPQRSTUVWXYZabcdefghijkmnopqrstuvwxyz"
+
+def base58_encode(data: bytes) -> str:
+    n = int.from_bytes(data, "big")
+    result = ""
+    while n > 0:
+        n, r = divmod(n, 58)
+        result = _B58[r] + result
+    for b in data:
+        if b == 0:
+            result = "1" + result
+        else:
+            break
+    return result
+
+def to_did_key(ed25519_pub_32: bytes) -> str:
+    """did:key:z<base58btc(0xed01 + pubkey)>  — W3C did:key method, Ed25519."""
+    return "did:key:z" + base58_encode(b"\xed\x01" + ed25519_pub_32)
+
+# ── Main ─────────────────────────────────────────────────────────────────
+
+def main():
+    mnemonic = os.environ.get("MNEMONIC", "").strip()
+    if not mnemonic:
+        print("Enter your 24-word mnemonic (space-separated):")
+        mnemonic = input("> ").strip()
+
+    words = mnemonic.split()
+    if len(words) != 24:
+        print(f"Error: expected 24 words, got {len(words)}", file=sys.stderr)
+        sys.exit(1)
+
+    seed = mnemonic_to_seed(mnemonic)
+
+    W = 72
+    print()
+    print("=" * W)
+    print("  ARCHIPELAGO SEED DERIVATION VERIFICATION")
+    print("=" * W)
+    print()
+    print(f"  Seed fingerprint (SHA-256):  {hashlib.sha256(seed).hexdigest()[:16]}...")
+    print(f"  Seed length:                 {len(seed)} bytes")
+
+    # ── 1. Node Ed25519 + DID ────────────────────────────────────────────
+
+    print()
+    print("-" * W)
+    print("  1. NODE ED25519 KEY")
+    print(f"     HKDF-SHA256(seed, info='archipelago/node/ed25519/v1')")
+    print("-" * W)
+
+    node_ed_priv, node_ed_pub = ed25519_keypair(
+        hkdf_sha256(seed, b"archipelago/node/ed25519/v1")
+    )
+    node_did = to_did_key(node_ed_pub)
+
+    print(f"  Private:  {node_ed_priv.hex()}")
+    print(f"  Public:   {node_ed_pub.hex()}")
+    print(f"  did:key:  {node_did}")
+
+    # ── 2. Node Nostr key ────────────────────────────────────────────────
+
+    print()
+    print("-" * W)
+    print("  2. NODE NOSTR KEY")
+    print(f"     HKDF-SHA256(seed, info='archipelago/nostr-node/secp256k1/v1')")
+    print("-" * W)
+
+    node_nostr_priv = hkdf_sha256(seed, b"archipelago/nostr-node/secp256k1/v1")
+    node_nostr_pub = secp256k1_xonly(node_nostr_priv)
+
+    print(f"  Private:  {node_nostr_priv.hex()}")
+    print(f"  X-only:   {node_nostr_pub.hex()}")
+    print(f"  nsec:     {bech32_encode('nsec', node_nostr_priv)}")
+    print(f"  npub:     {bech32_encode('npub', node_nostr_pub)}")
+
+    # ── 3. Identity[0..2] Ed25519 + DID ─────────────────────────────────
+
+    print()
+    print("-" * W)
+    print("  3. IDENTITY ED25519 KEYS + DID")
+    print(f"     HKDF-SHA256(seed, info='archipelago/identity/{{i}}/ed25519/v1')")
+    print("-" * W)
+
+    for i in range(3):
+        info = f"archipelago/identity/{i}/ed25519/v1".encode()
+        priv, pub = ed25519_keypair(hkdf_sha256(seed, info))
+        did = to_did_key(pub)
+        print(f"  [{i}] Public:   {pub.hex()}")
+        print(f"      did:key:  {did}")
+
+    # ── 4. Identity[0..2] Nostr (NIP-06 BIP-32) ────────────────────────
+
+    print()
+    print("-" * W)
+    print("  4. IDENTITY NOSTR KEYS (NIP-06)")
+    print(f"     BIP-32  m/44'/1237'/0'/0/{{i}}")
+    print("-" * W)
+
+    for i in range(3):
+        priv = bip32_derive(seed, f"m/44'/1237'/0'/0/{i}")
+        pub = secp256k1_xonly(priv)
+        print(f"  [{i}] X-only:   {pub.hex()}")
+        print(f"      nsec:     {bech32_encode('nsec', priv)}")
+        print(f"      npub:     {bech32_encode('npub', pub)}")
+
+    # ── 5. Bitcoin BIP-84 ───────────────────────────────────────────────
+
+    print()
+    print("-" * W)
+    print("  5. BITCOIN WALLET (BIP-84)")
+    print(f"     BIP-32  m/84'/0'/0'")
+    print("-" * W)
+
+    btc_acct = bip32_derive(seed, "m/84'/0'/0'")
+    btc_pub = secp256k1_xonly(btc_acct)
+    print(f"  Account key:  {btc_acct.hex()}")
+    print(f"  Account pub:  {btc_pub.hex()}")
+
+    # ── 6. LND Entropy ──────────────────────────────────────────────────
+
+    print()
+    print("-" * W)
+    print("  6. LND AEZEED ENTROPY")
+    print(f"     HKDF-SHA256(seed, info='archipelago/lnd/entropy/v1')  [16 bytes]")
+    print("-" * W)
+
+    lnd = hkdf_sha256(seed, b"archipelago/lnd/entropy/v1", 16)
+    print(f"  Entropy:  {lnd.hex()}")
+
+    # ── Done ─────────────────────────────────────────────────────────────
+
+    print()
+    print("=" * W)
+    print("  Compare these values with your Archipelago node:")
+    print("    UI:  Settings > Identity")
+    print("    SSH: xxd -p /var/lib/archipelago/identity/node_key.pub")
+    print("    RPC: curl -s http://<ip>/api/rpc \\")
+    print("           -d '{\"method\":\"identity.get-node\"}' | jq .")
+    print("=" * W)
+    print()
+
+if __name__ == "__main__":
+    main()
+```
+
+---
+
+## How to Run
+
+```bash
+# Install (two packages, pure crypto, no telemetry)
+pip3 install cryptography ecdsa
+
+# Option A: environment variable (doesn't persist in shell history)
+read -rs MNEMONIC && export MNEMONIC
+# (type or paste your 24 words, press Enter)
+python3 verify-seed.py
+unset MNEMONIC
+
+# Option B: interactive prompt
+python3 verify-seed.py
+# Enter your 24-word mnemonic (space-separated):
+# > abandon abandon ... art
+```
+
+---
+
+## What to Compare
+
+| Output field | Where to find on your node |
+|---|---|
+| Node Ed25519 public | `xxd -p /var/lib/archipelago/identity/node_key.pub` |
+| Node did:key | Settings > Identity > Node DID |
+| Node npub | Settings > Identity > Nostr Public Key |
+| Identity[0] did:key | Settings > Identity > first identity DID |
+| Identity[0] npub | Settings > Identity > first identity Nostr key |
+
+RPC alternative (from any machine on the LAN):
+
+```bash
+# Node identity
+curl -s http://192.168.1.228/api/rpc \
+  -H 'Content-Type: application/json' \
+  -d '{"method":"identity.get-node"}' | jq .
+
+# All identities
+curl -s http://192.168.1.228/api/rpc \
+  -H 'Content-Type: application/json' \
+  -d '{"method":"identity.list"}' | jq .
+```
+
+---
+
+## Cryptographic Reference
+
+### HKDF-SHA256 (RFC 5869)
+
+Used for Ed25519 and node-level Nostr keys. Domain separation via unique `info` strings
+prevents key reuse across contexts.
+
+```
+Extract:  PRK = HMAC-SHA256(salt=0x00*32, ikm=64_byte_seed)
+Expand:   OKM = HMAC-SHA256(PRK, info || 0x01)  [first 32 bytes]
+```
+
+The Rust backend uses `hkdf::Hkdf::<Sha256>::new(None, ikm)` where `None` salt = 32 zero bytes.
+
+### BIP-32 (secp256k1 HD derivation)
+
+Used for per-identity Nostr keys (NIP-06) and Bitcoin wallet.
+
+```
+Master:   HMAC-SHA512(key="Bitcoin seed", data=64_byte_seed)
+Child:    HMAC-SHA512(key=chain_code, data=0x00||key||index)  [hardened]
+          HMAC-SHA512(key=chain_code, data=pubkey||index)     [normal]
+```
+
+The Rust backend uses the `bitcoin` crate: `Xpriv::new_master()` + `derive_priv()`.
+
+### did:key (W3C)
+
+```
+did:key:z  +  base58btc( 0xED 0x01 || 32_byte_ed25519_pubkey )
+```
+
+Multicodec prefix `0xED 0x01` identifies Ed25519 public keys.
+The Rust backend uses `bs58::encode()` over a 34-byte buffer.
+
+### NIP-19 Bech32 (npub/nsec)
+
+```
+npub1...  =  bech32(hrp="npub", data=32_byte_x_only_pubkey)
+nsec1...  =  bech32(hrp="nsec", data=32_byte_private_key)
+```
+
+X-only pubkey = just the x-coordinate of the secp256k1 point (Schnorr format).
+
+---
+
+## Security
+
+- Run on an air-gapped machine or at minimum a private terminal session
+- The script makes zero network calls and writes zero files
+- After verification, clean up:
+  ```bash
+  rm verify-seed.py
+  unset MNEMONIC
+  history -c  # bash
+  # or: fc -W /dev/null  # zsh
+  ```
+- Never paste your mnemonic into a web tool, online REPL, or shared terminal