20 KiB
RESUME HERE — Rust orchestrator migration
Updated: 2026-04-23 (Step 9 + .228 dashboard bug fixes complete, Step 10 / chaos matrix next)
To resume this work, SSH into the ThinkPad and run opencode from ~/Projects/archy/. Or work from the laptop via the SSHFS mount at ~/mnt/archy-thinkpad/.
Where we are
Working through the 11-step plan in rust-orchestrator-migration.md.
- Step 1 —
3767c267ContainerConfig schema withbuild:,ResolvedSourceenum,resolve(), 10 tests - Step 2 —
34af4d9dContainerRuntime trait gainedimage_exists+build_image, 4 argv tests, 25/25 pass - Step 3 —
b6a04d31ProdContainerOrchestrator (999 LOC), 16 tests all pass, not yet wired to main.rs - Step 4 —
e8a59c93ContainerOrchestrator trait, RpcHandler uses it in prod (+13858842chore gitignore ._*) - Step 5 —
fc39b04bBootReconciler with Arc shutdown, 4 paused-time tests pass - Step 6 —
48f08aa3main.rs wire-up (orchestrator construction + adopt_existing + BootReconciler spawn + shutdown Notify) - Step 7 —
069bc4a5bitcoin-ui pre-start hook + embedded nginx.conf template (8 unit tests + 1 integration test), 39/39 container:: tests pass - Step 8a —
a0707f4dretire archipelago-reconcile.{service,timer} + ISO builder touchpoints, keep scripts for update.rs - Step 9 — Hot-swap on .228 verified. All three UIs (bitcoin-ui/lnd-ui/electrs-ui) installing + serving HTTP 200.
- .228 dashboard bugs — ExtraHost
192.168.1.254bug (3ee192ba) + LND macaroon permission bug (be960023). See "Post-Step 9 bug hunt" below. - Step 8b — Port remaining ~25 container creations from
first-boot-containers.shintoapps/<id>/manifest.yml, then portupdate.rsto orchestrator (deferred, multi-day work) - Step 8c — Rename
first-boot-containers.sh→first-boot-setup.sh, strip container ops, keep setup. Deletereconcile-containers.sh+container-specs.sh. Add ISO lines to copyapps/(final one-way door, requires 8b complete) - Step 10 — Hot-swap + verify on .116 (adoption-heavy test — .116 already has all containers running)
- Step 11 — Chaos matrix on both nodes (all 8 scenarios × all containers incl. bitcoin-core)
Post-Step 9 bug hunt (.228, 2026-04-23)
User reported three visible dashboard bugs after Step 9 verification:
- LND — "no connect details or QR"
- ElectrumX — stuck at "Building index (2 KB / ~130 GB)" for days
- bitcoin-core — in scope for chaos testing
Root cause #1 (ExtraHost, commit 3ee192ba): scripts/first-boot-containers.sh computed HOST_GATEWAY from ip route show default, which returns the LAN router (e.g. 192.168.1.254), not the gateway to the host. Every container configured with --add-host=host.containers.internal:$HOST_GATEWAY was dialing the WiFi router instead of the host. LND crash-looped with dial tcp 192.168.1.254:8332: connection refused; ElectrumX's DAEMON_URL hit the same dead end; any archy-net bridge consumer of bitcoin-core's RPC was broken. Fixed by replacing the computed value with podman's magic host-gateway literal (supported since 4.4; we ship 5.4.2). Live-recreated bitcoin-core/electrumx/lnd on .228 with the corrected --add-host; LND reached chain backend; ElectrumX resumed indexing (went from 2 KB → 164.9 MB in under an hour).
Root cause #2 (macaroon permissions, commit be960023): LND's admin.macaroon lives at /var/lib/archipelago/lnd/data/chain/bitcoin/mainnet/admin.macaroon, owned by rootless-podman subordinate UID 100000, mode 640. The archipelago server runs as host UID 1000 and literally cannot read the file. Every LND RPC (getinfo, connect-info, export-channel-backup) plus the shared lnd_client() helper failed with "Failed to read LND admin macaroon". Confirmed pre-existing on .116 too (long-standing bug unrelated to Step 9). Fix: centralised the path as LND_ADMIN_MACAROON_PATH, added a read_lnd_admin_macaroon() helper in api/rpc/lnd/mod.rs that tries direct read first then falls back to sudo -n cat (mirrors the pattern already used for Tor onion hostnames). Four call sites routed through the helper. Verified on .228 — curl -k https://<host>/lnd-connect-info now returns 200 with cert + macaroon + tor_onion; dashboard QR unblocked.
Step 9 evidence (.228, 2026-04-23)
- Binary: Step 9 build with
732df1b8+ba83f9bc, scp'd to .228 as/usr/local/bin/archipelago. Old binary backed up at/usr/local/bin/archipelago.bak-pre-step9. Later replaced with macaroon-fix build (be960023); previous backed up at/usr/local/bin/archipelago.bak-pre-macaroon. - DEV_MODE override disabled (
override.conf→override.conf.disabled-pre-step9). /opt/archipelago/apps/{bitcoin-ui,electrs-ui,lnd-ui}/manifest.ymlpopulated./opt/archipelago/docker/bitcoin-ui/Dockerfilereplaced with the Step 7 version (noCOPY nginx.conf). Old dir backed up asbitcoin-ui.bak-pre-step9.- Post-start snapshot:
🔗 Adopted 1 existing container(s): ["electrs-ui"]— adoption of 13h-running container worked without recreation🔄 Boot reconciler started (interval: 30s)— every 30s, all three app_ids reachNoOpafter the initial install passbitcoin-ui nginx.conf rendered path=/var/lib/archipelago/bitcoin-ui/nginx.conf auth_hash=97af1c18— pre-start hook fires ininstall_freshcurl localhost:8334→ HTTP 200 (bitcoin-ui),:8081→ 200 (lnd-ui),:50002→ 200 (electrs-ui)- OCI memory limits correctly applied: bitcoin-ui=128Mi, electrs-ui=128Mi, lnd-ui=64Mi (was emitted as 0 pre-fix)
Bugs fixed this session
parse_memory_limittruncation bug (732df1b8): lowercased "128Mi" → "128mi" →trim_end_matches('m')→ "128i" → f64 parse fails →None.unwrap_or(0)→ OCImemory.limit:0→ systemd rejects MemoryMax=0. 6 regression tests;create_containernow omits instead of emitting 0.archipelago.servicecgroup delegation missing (ba83f9bc): belt-and-bracesDelegate=memory pids cpu io.- ExtraHost
192.168.1.254(3ee192ba): see Post-Step 9 bug hunt above. - LND admin.macaroon unreadable (
be960023): see Post-Step 9 bug hunt above.
Commits made this session
3ee192ba fix(first-boot): use podman host-gateway magic for host.containers.internal
be960023 fix(lnd): read admin macaroon via sudo fallback
4b8ef0a0 docs: STATUS.md through Step 9 (.228 hot-swap verified)
ba83f9bc feat(systemd): delegate cgroup controllers to archipelago.service
732df1b8 fix: parse_memory_limit accepts Ki/Mi/Gi IEC binary suffixes
a0707f4d refactor: retire archipelago-reconcile.{service,timer} (Step 8a)
1c81a739 docs: split Step 8 into 8a/8b/8c
6e46932f docs: STATUS.md through Step 7
069bc4a5 feat: bitcoin-ui pre-start hook (Step 7)
Branch is 19 commits ahead of tx1138/main (local only — user pushes to mirrors personally).
Uncommitted state
Clean. Only untracked: tests/ (bats harness from prior session, not in scope), tmp-dump-spec.py (scratch).
Answered design questions (no need to re-ask)
- UI container naming →
archy-<app_id>for UIs only; existing bitcoin-knots/lnd/electrumx keep bare names - BITCOIN_RPC_AUTH injection → runtime bind-mount of nginx.conf (no build-args, no envsubst)
- Reconciler interval → 30 seconds
- Concurrency → per-app
Mutex<()>in aDashMap - Bash scripts → split into 8a/8b/8c; 8a done, 8b/8c deferred
- Step 4 extension →
ContainerOrchestratortrait includesinstall(app_id); themanifest_path-based install RPC stays dev-only - Step 7 bitcoin-ui template → embed via
include_str!, render on install + every reconcile, atomic tmp+rename to/var/lib/archipelago/bitcoin-ui/nginx.conf, bind-mount into container. RPC user hardcodedarchipelago, password from/var/lib/archipelago/secrets/bitcoin-rpc-password.
Context: which host is what
| Host | IP | Role | Dashboard pw | Sudo pw |
|---|---|---|---|---|
archy |
192.168.1.116 | Dev ThinkPad (Lenovo X250, Debian 13). Currently running v1.7.42-alpha (DEV_MODE). Step 10 target. | archipelago | ThisIsWeb54321@ |
archy228 |
192.168.1.228 | Kiosk HP ProDesk. Step 9 landing zone — now running Rust-orchestrator binary in prod mode. | password123 | archipelago |
Both are development alpha nodes — full destructive latitude, no need to ask before stop/start/rebuild.
Next action
Step 10 — Hot-swap on .116.
Unlike .228 (which tested the INSTALL path for net-new UI containers), .116 tests the ADOPTION path: it already has all three UIs and all backend containers running from prior v1.7.42-alpha runs. We want to verify the new prod orchestrator adopts every existing container without recreating or restarting them.
Steps:
- Disable DEV_MODE on .116 (check if override.conf exists —
/etc/systemd/system/archipelago.service.d/) - Stage the already-built binary at
~/Projects/archy/core/target/release/archipelago→/usr/local/bin/archipelago.new - Ensure
/opt/archipelago/apps/{bitcoin-ui,electrs-ui,lnd-ui}/manifest.ymlpresent (copy from repo) - Ensure
/opt/archipelago/docker/bitcoin-ui/matches the Step-7 layout (no baked nginx.conf) - Snapshot:
podman ps -a --format "{{.Names}}\t{{.Status}}\t{{.CreatedAt}}"→ save to/tmp/pre-step10-containers.txt systemctl stop archipelago→ install binary →systemctl start archipelago- Verify in journal: every running container appears in "Adopted N existing container(s)"; no container was recreated; all HTTP smokes still 200; BootReconciler reaches NoOp on every app_id after one pass.
- If broken → restore
.bakbinary, re-enable DEV_MODE override. - Commit STATUS.md update.
Risk on .116: If adoption fails mid-flight, we'd lose the running v1.7.42 backend that I'm currently typing at. Keep a second SSH session open to the ThinkPad for emergency revert. The backup plan is install /usr/local/bin/archipelago.bak /usr/local/bin/archipelago && systemctl restart archipelago.
After Step 10 we are blocked on Step 8b (multi-day manifest ports) before Step 11 (chaos matrix).
Why Step 8 got split (discovered 2026-04-23)
Original plan was one commit "delete bash + edit ISO builder". But on investigation:
first-boot-containers.shcreates 30+ containers with per-container logic (wallets, DB init, rpcauth derivations, post-create health waits). The repo only has manifests for 3 (bitcoin-ui, electrs-ui, lnd-ui from Step 7). Deleting bash now = brick first-boot on fresh installs.- Script also does non-container setup: secret generation (RPC pw, DB pw, FileBrowser admin pw), UID-mapping chowns for rootless podman subuid, Tor hostnames dir, WireGuard, firewall rules, nostr-relay dir. None of this lives in the Rust orchestrator.
update.rs(OTA update RPC) invokesreconcile-containers.shat two sites. Deleting the script breaks package updates. Porting those call sites to the orchestrator needs all containers to have manifests.- Design doc §505 updated to split 8 → 8a/8b/8c. Only 8a (delete the reconcile systemd unit + timer, BootReconciler covers) is safe to execute before we port manifests.
Archipelago — Current State, Plan, and Releases
Updated: 2026-04-22
This is the "pick this up tomorrow" page. One-stop summary of where we are, what the plan is, and what's shipped. Detailed plan lives in bulletproof-containers.md.
Current state
Fleet status
All four Gitea mirrors are synced to v1.7.40-alpha:
| Mirror | Host | Status |
|---|---|---|
| tx1138 | https://git.tx1138.com | ✅ v1.7.40-alpha live |
| gitea-local | http://localhost:3000 | ✅ v1.7.40-alpha live |
| .160 | http://23.182.128.160:3000 | ✅ v1.7.40-alpha live (Gitea recovered via podman system renumber — see below) |
| .168 | http://146.59.87.168:3000 | ✅ v1.7.40-alpha live |
Fleet test nodes:
| Node | Version | State |
|---|---|---|
| .103 (dev) | 1.7.40 | running, being developed against |
| .116 (this box) | 1.7.40 | healed manually via systemd-run chmod 755 /opt/archipelago/web-ui after v1.7.38/39 bug |
| .198 | 1.7.39 → 1.7.40-alpha | healed manually |
| .228 (primary test) | 1.7.40-alpha | healed manually; bitcoin-core + lnd + electrumx running; UI companions currently missing; bitcoin.conf rpcauth patched live |
| .249 (ISO test) | unreachable today | |
| .253 | 1.7.39 → 1.7.40-alpha | healed manually |
Known open issues (drives the plan below)
- UI companion containers disappear on .228 after daemon restarts — no auto-recreate (fixed by v1.7.45 Quadlet migration)
- bitcoin.conf rpcauth drifts from canonical secret → ElectrumX "Daemon connection problem" (fixed by v1.7.43 reconcile::derived)
host.containers.internalresolves to LAN gateway inside containers on some versions (fixed by v1.7.42 containers.conf)- Podman state DB loss requires manual recovery (fixed by v1.7.44 startup self-heal)
- LND "Connect Wallet" info vanishing after crashes — symptom of the same drift class as #2
- ElectrumX not syncing on .228 — downstream of #2; will resolve when bitcoin.conf is reconciled
Recent field incident (2026-04-22)
- Shipped v1.7.38 + v1.7.39, both broke nginx fleet-wide because the frontend tarball's root dir was
drwx------(700). Every node that OTA'd got 500 errors on every page. - Root-cause fix shipped in v1.7.40 (
create-release-manifest.shchmod + pre-ship assertion thattar tvzf | head -1showsdrwxr-xr-x). - .160 Gitea was down all day (502) because its rootless podman's
libpod/bolt_state.dbhad vanished. Recovered via clearing/run/user/$UID/{containers,libpod,podman}+podman system renumber. - Full failure-mode audit is in
bulletproof-containers.md.
Plan
We're shipping a level-triggered reconciler + Quadlet architecture over six incremental releases. Each release closes one failure mode. See bulletproof-containers.md for the full design, code layout, test harness, chaos matrix, sources.
Release roadmap
| Release | Closes | What lands | Status |
|---|---|---|---|
| v1.7.41 | FM5 (bad OTA nginx 500) | Post-OTA auto-rollback. New binary probes https://127.0.0.1/ on boot; if non-200 within 90s, restores web-ui.bak + calls rollback_update() + restarts |
in flight — deploying to .228 for test |
| v1.7.42 | FM4 (host.containers.internal wrong) |
/etc/containers/containers.conf w/ host_containers_internal_ip = 10.89.0.1; every container gets --add-host=host.archipelago:10.89.0.1 |
pending |
| v1.7.43 | FM2 (config drift) | reconcile::derived::render_bitcoin_conf — pure fn over canonical secret, rewrites on drift. Same for lnd.conf |
pending |
| v1.7.44 | FM6 (podman state loss) | Startup probe detects broken podman state, auto-recovers via /run/user/$UID/* clear + system renumber |
pending |
| v1.7.45 | FM1 + FM3 (companion orphans) | archy-bitcoin-ui → Quadlet .container unit in /etc/containers/systemd/. systemd (not archipelago) owns it |
pending |
| v1.7.46 | — | archy-lnd-ui → Quadlet |
pending |
| v1.7.47 | — | archy-electrs-ui → Quadlet |
pending |
| v1.7.48+ | all (full daemon refactor) | core/archipelago/src/reconcile/ module replaces imperative install.rs container management. Main app containers become Quadlet too |
pending |
Test harness (bats + Goss + Chaos Toolkit + vmtest) lands scaffold in v1.7.41, first lifecycle tests blocking v1.7.45, full matrix blocking beta tag.
Release history
v1.7.41-alpha — IN FLIGHT — 2026-04-22
Post-OTA auto-rollback. After an update lands, the node probes its own web UI through nginx — if the frontend isn't answering cleanly within 90 seconds, the node automatically rolls back to the previous version and restarts. A bad release can no longer leave the fleet stranded on an unreachable node.
Changes:
core/archipelago/src/update.rs:PendingVerificationstruct, write marker before service restart,verify_pending_update()on new binary boot — probeshttps://127.0.0.1/, on fail restoresweb-ui.bak+ callsrollback_update()+systemctl restart archipelagocore/archipelago/src/main.rs: startup task invokes verifier concurrently with server
v1.7.40-alpha — 2026-04-22
Proper fix for the 500 error. Fixed the v1.7.38/39 tarball-perms bug at its source — staging dir is now explicitly chmod 755 before tar; --mode=u=rwX,go=rX normalizes archive perms; pre-ship assertion aborts release if tar tvzf | head -1 isn't drwxr-xr-x.
Changes:
scripts/create-release-manifest.sh: pre-tar chmod + tar --mode flag + post-tar verify- Everything from .38 + .39 still in place (onboarding auto-heal, silent logins, app purge, AIUI in tarball)
v1.7.39-alpha — 2026-04-22
Hotfix attempt for v1.7.38's nginx 500 (didn't fully work — still shipped broken tarball perms). Added startup self-heal chmod in main.rs and post-extract chmod in update.rs OTA applier.
v1.7.38-alpha — 2026-04-22
Onboarding auto-heal + silent logins + App Store trim.
Changes:
auth.rs:is_onboarding_complete()auto-heals fromsetup_complete+password_hash(prevents clear-cache → onboarding wizard bug)useOnboarding: tri-state — backend-unreachable no longer defaults to/onboarding/intro- Login sounds gated by
isFirstInstallPhase()— silent after onboarding, typing sounds unaffected - Removed FIPS app, Nostr Relay, Nostr VPN, Routstr, Penpot from catalog + Rust + docker + icons
- Deleted 15 image versions from tx1138, .168, gitea-local registries
- AIUI baked into release tarball via
demo/aiui/ prebuildhook syncsapp-catalog/catalog.json→public/catalog.json
(Shipped with tarball-perms bug; fleet had to be healed before v1.7.40.)
v1.7.37-alpha — 2026-04-22
Bitcoin Core install fixes + dynamic node UI + full-archive default.
- Bitcoin Core passes explicit
-rpcbind/-rpcallowip/etc.CLI args so vanilla image exposes RPC - Split
bitcoin-corefrombitcoin-knotsin backendAppMetadata - bitcoin-ui auto-detects Core vs. Knots from subversion, swaps branding at runtime
- Storage (Full Archive · X GB / Pruned) indicator on dashboard
- Node Settings modal shows real values (network, storage, txindex, ZMQ, RPC port)
- Pull fallback to
docker.iowhen no mirror carries the image - Removed
prune=550hardcode — full archive default
Key docs
bulletproof-containers.md— full reconcile architecture, code layout, test matrix, chaos scenarios, sourcesBETA-RELEASE-CHECKLIST.md— existing beta checklistBETA-ISSUES-20260328.md— prior beta-blocker trackinghotfix-process.md— release workflowarchitecture.md— system architecture overview
How to resume
- Check fleet mirrors are all live:
curl -sS https://git.tx1138.com/lfg2025/archy/raw/branch/main/releases/manifest.json | jq .version - Read
bulletproof-containers.mdfor the current plan - Check task list (
/listor via Claude Code) for the in-flight release - Latest in-flight work: v1.7.41 deploying to .228 for test; will ship to all 4 mirrors once verified