262 lines
20 KiB
Markdown
262 lines
20 KiB
Markdown
# RESUME HERE — Rust orchestrator migration
|
||
|
||
Updated: 2026-04-23 (Step 9 + .228 dashboard bug fixes complete, Step 10 / chaos matrix next)
|
||
|
||
**To resume this work, SSH into the ThinkPad and run `opencode` from `~/Projects/archy/`. Or work from the laptop via the SSHFS mount at `~/mnt/archy-thinkpad/`.**
|
||
|
||
## Where we are
|
||
|
||
Working through the 11-step plan in [`rust-orchestrator-migration.md`](./rust-orchestrator-migration.md).
|
||
|
||
- [x] **Step 1** — `3767c267` ContainerConfig schema with `build:`, `ResolvedSource` enum, `resolve()`, 10 tests
|
||
- [x] **Step 2** — `34af4d9d` ContainerRuntime trait gained `image_exists` + `build_image`, 4 argv tests, 25/25 pass
|
||
- [x] **Step 3** — `b6a04d31` ProdContainerOrchestrator (999 LOC), 16 tests all pass, not yet wired to main.rs
|
||
- [x] **Step 4** — `e8a59c93` ContainerOrchestrator trait, RpcHandler uses it in prod (+ `13858842` chore gitignore ._*)
|
||
- [x] **Step 5** — `fc39b04b` BootReconciler with Arc<Notify> shutdown, 4 paused-time tests pass
|
||
- [x] **Step 6** — `48f08aa3` main.rs wire-up (orchestrator construction + adopt_existing + BootReconciler spawn + shutdown Notify)
|
||
- [x] **Step 7** — `069bc4a5` bitcoin-ui pre-start hook + embedded nginx.conf template (8 unit tests + 1 integration test), 39/39 container:: tests pass
|
||
- [x] **Step 8a** — `a0707f4d` retire archipelago-reconcile.{service,timer} + ISO builder touchpoints, keep scripts for update.rs
|
||
- [x] **Step 9** — **Hot-swap on .228 verified.** All three UIs (bitcoin-ui/lnd-ui/electrs-ui) installing + serving HTTP 200.
|
||
- [x] **.228 dashboard bugs** — ExtraHost `192.168.1.254` bug (`3ee192ba`) + LND macaroon permission bug (`be960023`). See "Post-Step 9 bug hunt" below.
|
||
- [ ] **Step 8b** — Port remaining ~25 container creations from `first-boot-containers.sh` into `apps/<id>/manifest.yml`, then port `update.rs` to orchestrator (deferred, multi-day work)
|
||
- [ ] **Step 8c** — Rename `first-boot-containers.sh` → `first-boot-setup.sh`, strip container ops, keep setup. Delete `reconcile-containers.sh` + `container-specs.sh`. Add ISO lines to copy `apps/` (final one-way door, requires 8b complete)
|
||
- [ ] **Step 10** — Hot-swap + verify on .116 (adoption-heavy test — .116 already has all containers running)
|
||
- [ ] **Step 11** — Chaos matrix on both nodes (all 8 scenarios × all containers incl. bitcoin-core)
|
||
|
||
## Post-Step 9 bug hunt (.228, 2026-04-23)
|
||
|
||
User reported three visible dashboard bugs after Step 9 verification:
|
||
1. LND — "no connect details or QR"
|
||
2. ElectrumX — stuck at "Building index (2 KB / ~130 GB)" for days
|
||
3. bitcoin-core — in scope for chaos testing
|
||
|
||
**Root cause #1 (ExtraHost, commit `3ee192ba`)**: `scripts/first-boot-containers.sh` computed `HOST_GATEWAY` from `ip route show default`, which returns the **LAN router** (e.g. 192.168.1.254), not the gateway to the host. Every container configured with `--add-host=host.containers.internal:$HOST_GATEWAY` was dialing the WiFi router instead of the host. LND crash-looped with `dial tcp 192.168.1.254:8332: connection refused`; ElectrumX's DAEMON_URL hit the same dead end; any `archy-net` bridge consumer of bitcoin-core's RPC was broken. Fixed by replacing the computed value with podman's magic `host-gateway` literal (supported since 4.4; we ship 5.4.2). Live-recreated bitcoin-core/electrumx/lnd on .228 with the corrected `--add-host`; LND reached chain backend; ElectrumX resumed indexing (went from 2 KB → 164.9 MB in under an hour).
|
||
|
||
**Root cause #2 (macaroon permissions, commit `be960023`)**: LND's `admin.macaroon` lives at `/var/lib/archipelago/lnd/data/chain/bitcoin/mainnet/admin.macaroon`, owned by rootless-podman subordinate UID 100000, mode 640. The archipelago server runs as host UID 1000 and literally cannot read the file. Every LND RPC (`getinfo`, `connect-info`, `export-channel-backup`) plus the shared `lnd_client()` helper failed with "Failed to read LND admin macaroon". **Confirmed pre-existing on .116 too** (long-standing bug unrelated to Step 9). Fix: centralised the path as `LND_ADMIN_MACAROON_PATH`, added a `read_lnd_admin_macaroon()` helper in `api/rpc/lnd/mod.rs` that tries direct read first then falls back to `sudo -n cat` (mirrors the pattern already used for Tor onion hostnames). Four call sites routed through the helper. Verified on .228 — `curl -k https://<host>/lnd-connect-info` now returns 200 with cert + macaroon + tor_onion; dashboard QR unblocked.
|
||
|
||
## Step 9 evidence (.228, 2026-04-23)
|
||
|
||
- Binary: Step 9 build with `732df1b8` + `ba83f9bc`, scp'd to .228 as `/usr/local/bin/archipelago`. Old binary backed up at `/usr/local/bin/archipelago.bak-pre-step9`. Later replaced with macaroon-fix build (`be960023`); previous backed up at `/usr/local/bin/archipelago.bak-pre-macaroon`.
|
||
- DEV_MODE override disabled (`override.conf` → `override.conf.disabled-pre-step9`).
|
||
- `/opt/archipelago/apps/{bitcoin-ui,electrs-ui,lnd-ui}/manifest.yml` populated.
|
||
- `/opt/archipelago/docker/bitcoin-ui/Dockerfile` replaced with the Step 7 version (no `COPY nginx.conf`). Old dir backed up as `bitcoin-ui.bak-pre-step9`.
|
||
- Post-start snapshot:
|
||
- `🔗 Adopted 1 existing container(s): ["electrs-ui"]` — adoption of 13h-running container worked without recreation
|
||
- `🔄 Boot reconciler started (interval: 30s)` — every 30s, all three app_ids reach `NoOp` after the initial install pass
|
||
- `bitcoin-ui nginx.conf rendered path=/var/lib/archipelago/bitcoin-ui/nginx.conf auth_hash=97af1c18` — pre-start hook fires in `install_fresh`
|
||
- `curl localhost:8334` → HTTP 200 (bitcoin-ui), `:8081` → 200 (lnd-ui), `:50002` → 200 (electrs-ui)
|
||
- OCI memory limits correctly applied: bitcoin-ui=128Mi, electrs-ui=128Mi, lnd-ui=64Mi (was emitted as 0 pre-fix)
|
||
|
||
## Bugs fixed this session
|
||
|
||
1. **`parse_memory_limit` truncation bug** (`732df1b8`): lowercased "128Mi" → "128mi" → `trim_end_matches('m')` → "128i" → f64 parse fails → `None.unwrap_or(0)` → OCI `memory.limit:0` → systemd rejects MemoryMax=0. 6 regression tests; `create_container` now omits instead of emitting 0.
|
||
2. **`archipelago.service` cgroup delegation missing** (`ba83f9bc`): belt-and-braces `Delegate=memory pids cpu io`.
|
||
3. **ExtraHost `192.168.1.254`** (`3ee192ba`): see Post-Step 9 bug hunt above.
|
||
4. **LND admin.macaroon unreadable** (`be960023`): see Post-Step 9 bug hunt above.
|
||
|
||
## Commits made this session
|
||
|
||
```
|
||
3ee192ba fix(first-boot): use podman host-gateway magic for host.containers.internal
|
||
be960023 fix(lnd): read admin macaroon via sudo fallback
|
||
4b8ef0a0 docs: STATUS.md through Step 9 (.228 hot-swap verified)
|
||
ba83f9bc feat(systemd): delegate cgroup controllers to archipelago.service
|
||
732df1b8 fix: parse_memory_limit accepts Ki/Mi/Gi IEC binary suffixes
|
||
a0707f4d refactor: retire archipelago-reconcile.{service,timer} (Step 8a)
|
||
1c81a739 docs: split Step 8 into 8a/8b/8c
|
||
6e46932f docs: STATUS.md through Step 7
|
||
069bc4a5 feat: bitcoin-ui pre-start hook (Step 7)
|
||
```
|
||
|
||
Branch is **19 commits ahead of tx1138/main** (local only — user pushes to mirrors personally).
|
||
|
||
## Uncommitted state
|
||
|
||
Clean. Only untracked: `tests/` (bats harness from prior session, not in scope), `tmp-dump-spec.py` (scratch).
|
||
|
||
## Answered design questions (no need to re-ask)
|
||
|
||
1. UI container naming → `archy-<app_id>` for UIs only; existing bitcoin-knots/lnd/electrumx keep bare names
|
||
2. BITCOIN_RPC_AUTH injection → runtime bind-mount of nginx.conf (no build-args, no envsubst)
|
||
3. Reconciler interval → 30 seconds
|
||
4. Concurrency → per-app `Mutex<()>` in a `DashMap`
|
||
5. Bash scripts → split into 8a/8b/8c; 8a done, 8b/8c deferred
|
||
6. Step 4 extension → `ContainerOrchestrator` trait includes `install(app_id)`; the `manifest_path`-based install RPC stays dev-only
|
||
7. Step 7 bitcoin-ui template → embed via `include_str!`, render on install + every reconcile, atomic tmp+rename to `/var/lib/archipelago/bitcoin-ui/nginx.conf`, bind-mount into container. RPC user hardcoded `archipelago`, password from `/var/lib/archipelago/secrets/bitcoin-rpc-password`.
|
||
|
||
## Context: which host is what
|
||
|
||
| Host | IP | Role | Dashboard pw | Sudo pw |
|
||
|---|---|---|---|---|
|
||
| `archy` | 192.168.1.116 | **Dev ThinkPad** (Lenovo X250, Debian 13). Currently running v1.7.42-alpha (DEV_MODE). Step 10 target. | archipelago | ThisIsWeb54321@ |
|
||
| `archy228` | 192.168.1.228 | Kiosk HP ProDesk. **Step 9 landing zone** — now running Rust-orchestrator binary in prod mode. | password123 | archipelago |
|
||
|
||
Both are development alpha nodes — **full destructive latitude**, no need to ask before stop/start/rebuild.
|
||
|
||
## Next action
|
||
|
||
**Step 10 — Hot-swap on .116.**
|
||
|
||
Unlike .228 (which tested the INSTALL path for net-new UI containers), .116 tests the ADOPTION path: it already has all three UIs and all backend containers running from prior v1.7.42-alpha runs. We want to verify the new prod orchestrator adopts every existing container without recreating or restarting them.
|
||
|
||
Steps:
|
||
1. Disable DEV_MODE on .116 (check if override.conf exists — `/etc/systemd/system/archipelago.service.d/`)
|
||
2. Stage the already-built binary at `~/Projects/archy/core/target/release/archipelago` → `/usr/local/bin/archipelago.new`
|
||
3. Ensure `/opt/archipelago/apps/{bitcoin-ui,electrs-ui,lnd-ui}/manifest.yml` present (copy from repo)
|
||
4. Ensure `/opt/archipelago/docker/bitcoin-ui/` matches the Step-7 layout (no baked nginx.conf)
|
||
5. Snapshot: `podman ps -a --format "{{.Names}}\t{{.Status}}\t{{.CreatedAt}}"` → save to `/tmp/pre-step10-containers.txt`
|
||
6. `systemctl stop archipelago` → install binary → `systemctl start archipelago`
|
||
7. Verify in journal: every running container appears in "Adopted N existing container(s)"; no container was recreated; all HTTP smokes still 200; BootReconciler reaches NoOp on every app_id after one pass.
|
||
8. If broken → restore `.bak` binary, re-enable DEV_MODE override.
|
||
9. Commit STATUS.md update.
|
||
|
||
**Risk on .116:** If adoption fails mid-flight, we'd lose the running v1.7.42 backend that I'm currently typing at. Keep a second SSH session open to the ThinkPad for emergency revert. The backup plan is `install /usr/local/bin/archipelago.bak /usr/local/bin/archipelago && systemctl restart archipelago`.
|
||
|
||
**After Step 10 we are blocked on Step 8b** (multi-day manifest ports) before Step 11 (chaos matrix).
|
||
|
||
---
|
||
|
||
### Why Step 8 got split (discovered 2026-04-23)
|
||
|
||
Original plan was one commit "delete bash + edit ISO builder". But on investigation:
|
||
- `first-boot-containers.sh` creates **30+ containers** with per-container logic (wallets, DB init, rpcauth derivations, post-create health waits). The repo only has manifests for 3 (bitcoin-ui, electrs-ui, lnd-ui from Step 7). Deleting bash now = brick first-boot on fresh installs.
|
||
- Script also does non-container setup: secret generation (RPC pw, DB pw, FileBrowser admin pw), UID-mapping chowns for rootless podman subuid, Tor hostnames dir, WireGuard, firewall rules, nostr-relay dir. None of this lives in the Rust orchestrator.
|
||
- `update.rs` (OTA update RPC) invokes `reconcile-containers.sh` at two sites. Deleting the script breaks package updates. Porting those call sites to the orchestrator needs all containers to have manifests.
|
||
- Design doc §505 updated to split 8 → 8a/8b/8c. Only 8a (delete the reconcile systemd unit + timer, BootReconciler covers) is safe to execute before we port manifests.
|
||
|
||
---
|
||
|
||
# Archipelago — Current State, Plan, and Releases
|
||
|
||
Updated: 2026-04-22
|
||
|
||
This is the "pick this up tomorrow" page. One-stop summary of where we are, what the plan is, and what's shipped. Detailed plan lives in [`bulletproof-containers.md`](./bulletproof-containers.md).
|
||
|
||
---
|
||
|
||
## Current state
|
||
|
||
### Fleet status
|
||
|
||
All four Gitea mirrors are synced to v1.7.40-alpha:
|
||
|
||
| Mirror | Host | Status |
|
||
|---|---|---|
|
||
| tx1138 | https://git.tx1138.com | ✅ v1.7.40-alpha live |
|
||
| gitea-local | http://localhost:3000 | ✅ v1.7.40-alpha live |
|
||
| .160 | http://23.182.128.160:3000 | ✅ v1.7.40-alpha live (Gitea recovered via `podman system renumber` — see below) |
|
||
| .168 | http://146.59.87.168:3000 | ✅ v1.7.40-alpha live |
|
||
|
||
Fleet test nodes:
|
||
|
||
| Node | Version | State |
|
||
|---|---|---|
|
||
| .103 (dev) | 1.7.40 | running, being developed against |
|
||
| .116 (this box) | 1.7.40 | healed manually via `systemd-run chmod 755 /opt/archipelago/web-ui` after v1.7.38/39 bug |
|
||
| .198 | 1.7.39 → 1.7.40-alpha | healed manually |
|
||
| .228 (primary test) | 1.7.40-alpha | healed manually; bitcoin-core + lnd + electrumx running; UI companions currently missing; bitcoin.conf rpcauth patched live |
|
||
| .249 (ISO test) | unreachable today | |
|
||
| .253 | 1.7.39 → 1.7.40-alpha | healed manually |
|
||
|
||
### Known open issues (drives the plan below)
|
||
|
||
1. **UI companion containers disappear** on .228 after daemon restarts — no auto-recreate (fixed by v1.7.45 Quadlet migration)
|
||
2. **bitcoin.conf rpcauth drifts** from canonical secret → ElectrumX "Daemon connection problem" (fixed by v1.7.43 reconcile::derived)
|
||
3. **`host.containers.internal`** resolves to LAN gateway inside containers on some versions (fixed by v1.7.42 containers.conf)
|
||
4. **Podman state DB loss** requires manual recovery (fixed by v1.7.44 startup self-heal)
|
||
5. **LND "Connect Wallet" info** vanishing after crashes — symptom of the same drift class as #2
|
||
6. **ElectrumX not syncing** on .228 — downstream of #2; will resolve when bitcoin.conf is reconciled
|
||
|
||
### Recent field incident (2026-04-22)
|
||
|
||
- Shipped v1.7.38 + v1.7.39, both broke nginx fleet-wide because the frontend tarball's root dir was `drwx------` (700). Every node that OTA'd got 500 errors on every page.
|
||
- Root-cause fix shipped in v1.7.40 (`create-release-manifest.sh` chmod + pre-ship assertion that `tar tvzf | head -1` shows `drwxr-xr-x`).
|
||
- .160 Gitea was down all day (502) because its rootless podman's `libpod/bolt_state.db` had vanished. Recovered via clearing `/run/user/$UID/{containers,libpod,podman}` + `podman system renumber`.
|
||
- Full failure-mode audit is in [`bulletproof-containers.md`](./bulletproof-containers.md).
|
||
|
||
---
|
||
|
||
## Plan
|
||
|
||
We're shipping a level-triggered **reconciler + Quadlet** architecture over six incremental releases. Each release closes one failure mode. See [`bulletproof-containers.md`](./bulletproof-containers.md) for the full design, code layout, test harness, chaos matrix, sources.
|
||
|
||
### Release roadmap
|
||
|
||
| Release | Closes | What lands | Status |
|
||
|---|---|---|---|
|
||
| **v1.7.41** | FM5 (bad OTA nginx 500) | Post-OTA auto-rollback. New binary probes `https://127.0.0.1/` on boot; if non-200 within 90s, restores `web-ui.bak` + calls `rollback_update()` + restarts | **in flight — deploying to .228 for test** |
|
||
| **v1.7.42** | FM4 (`host.containers.internal` wrong) | `/etc/containers/containers.conf` w/ `host_containers_internal_ip = 10.89.0.1`; every container gets `--add-host=host.archipelago:10.89.0.1` | pending |
|
||
| **v1.7.43** | FM2 (config drift) | `reconcile::derived::render_bitcoin_conf` — pure fn over canonical secret, rewrites on drift. Same for `lnd.conf` | pending |
|
||
| **v1.7.44** | FM6 (podman state loss) | Startup probe detects broken podman state, auto-recovers via `/run/user/$UID/*` clear + `system renumber` | pending |
|
||
| **v1.7.45** | FM1 + FM3 (companion orphans) | `archy-bitcoin-ui` → Quadlet `.container` unit in `/etc/containers/systemd/`. systemd (not archipelago) owns it | pending |
|
||
| **v1.7.46** | — | `archy-lnd-ui` → Quadlet | pending |
|
||
| **v1.7.47** | — | `archy-electrs-ui` → Quadlet | pending |
|
||
| **v1.7.48+** | all (full daemon refactor) | `core/archipelago/src/reconcile/` module replaces imperative `install.rs` container management. Main app containers become Quadlet too | pending |
|
||
|
||
Test harness (bats + Goss + Chaos Toolkit + vmtest) lands scaffold in v1.7.41, first lifecycle tests blocking v1.7.45, full matrix blocking beta tag.
|
||
|
||
---
|
||
|
||
## Release history
|
||
|
||
### [v1.7.41-alpha](/releases/v1.7.41-alpha/) — IN FLIGHT — 2026-04-22
|
||
**Post-OTA auto-rollback.** After an update lands, the node probes its own web UI through nginx — if the frontend isn't answering cleanly within 90 seconds, the node automatically rolls back to the previous version and restarts. A bad release can no longer leave the fleet stranded on an unreachable node.
|
||
|
||
Changes:
|
||
- `core/archipelago/src/update.rs`: `PendingVerification` struct, write marker before service restart, `verify_pending_update()` on new binary boot — probes `https://127.0.0.1/`, on fail restores `web-ui.bak` + calls `rollback_update()` + `systemctl restart archipelago`
|
||
- `core/archipelago/src/main.rs`: startup task invokes verifier concurrently with server
|
||
|
||
### [v1.7.40-alpha](https://git.tx1138.com/lfg2025/archy/raw/branch/main/releases/v1.7.40-alpha/) — 2026-04-22
|
||
**Proper fix for the 500 error.** Fixed the v1.7.38/39 tarball-perms bug at its source — staging dir is now explicitly `chmod 755` before tar; `--mode=u=rwX,go=rX` normalizes archive perms; pre-ship assertion aborts release if `tar tvzf | head -1` isn't `drwxr-xr-x`.
|
||
|
||
Changes:
|
||
- `scripts/create-release-manifest.sh`: pre-tar chmod + tar --mode flag + post-tar verify
|
||
- Everything from .38 + .39 still in place (onboarding auto-heal, silent logins, app purge, AIUI in tarball)
|
||
|
||
### [v1.7.39-alpha](https://git.tx1138.com/lfg2025/archy/raw/branch/main/releases/v1.7.39-alpha/) — 2026-04-22
|
||
**Hotfix attempt** for v1.7.38's nginx 500 (didn't fully work — still shipped broken tarball perms). Added startup self-heal chmod in `main.rs` and post-extract chmod in `update.rs` OTA applier.
|
||
|
||
### [v1.7.38-alpha](https://git.tx1138.com/lfg2025/archy/raw/branch/main/releases/v1.7.38-alpha/) — 2026-04-22
|
||
**Onboarding auto-heal + silent logins + App Store trim.**
|
||
|
||
Changes:
|
||
- `auth.rs`: `is_onboarding_complete()` auto-heals from `setup_complete` + `password_hash` (prevents clear-cache → onboarding wizard bug)
|
||
- `useOnboarding`: tri-state — backend-unreachable no longer defaults to `/onboarding/intro`
|
||
- Login sounds gated by `isFirstInstallPhase()` — silent after onboarding, typing sounds unaffected
|
||
- Removed FIPS app, Nostr Relay, Nostr VPN, Routstr, Penpot from catalog + Rust + docker + icons
|
||
- Deleted 15 image versions from tx1138, .168, gitea-local registries
|
||
- AIUI baked into release tarball via `demo/aiui/`
|
||
- `prebuild` hook syncs `app-catalog/catalog.json` → `public/catalog.json`
|
||
|
||
(Shipped with tarball-perms bug; fleet had to be healed before v1.7.40.)
|
||
|
||
### [v1.7.37-alpha](https://git.tx1138.com/lfg2025/archy/raw/branch/main/releases/v1.7.37-alpha/) — 2026-04-22
|
||
**Bitcoin Core install fixes + dynamic node UI + full-archive default.**
|
||
|
||
- Bitcoin Core passes explicit `-rpcbind/-rpcallowip/etc.` CLI args so vanilla image exposes RPC
|
||
- Split `bitcoin-core` from `bitcoin-knots` in backend `AppMetadata`
|
||
- bitcoin-ui auto-detects Core vs. Knots from subversion, swaps branding at runtime
|
||
- Storage (Full Archive · X GB / Pruned) indicator on dashboard
|
||
- Node Settings modal shows real values (network, storage, txindex, ZMQ, RPC port)
|
||
- Pull fallback to `docker.io` when no mirror carries the image
|
||
- Removed `prune=550` hardcode — full archive default
|
||
|
||
---
|
||
|
||
## Key docs
|
||
|
||
- [`bulletproof-containers.md`](./bulletproof-containers.md) — full reconcile architecture, code layout, test matrix, chaos scenarios, sources
|
||
- [`BETA-RELEASE-CHECKLIST.md`](./BETA-RELEASE-CHECKLIST.md) — existing beta checklist
|
||
- [`BETA-ISSUES-20260328.md`](./BETA-ISSUES-20260328.md) — prior beta-blocker tracking
|
||
- [`hotfix-process.md`](./hotfix-process.md) — release workflow
|
||
- [`architecture.md`](./architecture.md) — system architecture overview
|
||
|
||
---
|
||
|
||
## How to resume
|
||
|
||
1. Check fleet mirrors are all live: `curl -sS https://git.tx1138.com/lfg2025/archy/raw/branch/main/releases/manifest.json | jq .version`
|
||
2. Read [`bulletproof-containers.md`](./bulletproof-containers.md) for the current plan
|
||
3. Check task list (`/list` or via Claude Code) for the in-flight release
|
||
4. Latest in-flight work: v1.7.41 deploying to .228 for test; will ship to all 4 mirrors once verified
|