Files
archy/docs/STATUS.md
archipelago f9fef8d2cc docs(status): record rounds 3-5 + config migration + changelog as shipped
Adds a new top section to STATUS.md covering v1.7.43-alpha:

- Round 3: phase-based install progress bar
- Round 4: post-install scanner kick for instant Launch button
- Round 5: .23 VPS retirement, .168 promoted to Server 1
- Config migration: auto-purge .23 from saved registry/mirror JSONs
- Changelog: new v1.7.43-alpha entry in AccountInfoSection

All 5 commits, deployment md5, verification notes, and git remote
cleanup captured. Round 2 rollback command still valid for the full
stack since backups predate every round in this session.
2026-04-23 09:09:02 -04:00

56 KiB
Raw Permalink Blame History

RESUME HERE — Rust orchestrator migration

Updated: 2026-04-23 (Install UX polish: phase-based progress bar, post-install scanner kick for instant Launch button, .23 VPS retired with auto-purge migration, frontend/backend deployed to .228 as v1.7.43-alpha.)

To resume this work, SSH into the ThinkPad and run opencode from ~/Projects/archy/. Or work from the laptop via the SSHFS mount at ~/mnt/archy-thinkpad/.


INSTALL UX POLISH + .23 RETIREMENT — SHIPPED (v1.7.43-alpha)

Rounds 35 + config migration + changelog (2026-04-23) — 5 commits on main (unpushed per user mirror protocol):

  • 8cc84ebc feat(install): phase-based progress bar replaces unparseable pull bytespodman pull emits zero parseable progress when stderr is piped (no TTY), so the legacy byte-counting regex never matched. Replaced with 7 phase-based levels: Preparing (5%) → PullingImage (20%) → CreatingContainer (70%) → StartingContainer (80%) → WaitingHealthy (88%) → PostInstall (95%) → Done (100%). UI maps phases to fixed % and only advances forward (Math.max). Final phase label renamed from "Running post-install…" to "Finalizing…" after user feedback that it read like a regression to the install step.
  • f86d86c3 fix(install): kick scanner post-install so Launch button appears immediately — scan runs every 60s; post-install the state flipped to Running but the skeletal install-time manifest (interfaces: None) persisted until next scan, so canLaunch(pkg) returned false for up to a minute. Added scan_kick: Arc<Notify> + scan_tick: Arc<watch::Sender<u64>> on RpcHandler. Scan loop uses tokio::select! between the 60s interval and the notify. New kick_scanner_and_wait helper (2s timeout) called in install/update success paths BEFORE writing Running, so a fresh manifest lands first. Merge during Installing/Updating uses merge_preserving_transitional (keeps state, takes fresh manifest).
  • 22052325 chore: retire .23 VPS mirror, promote .168 OVH to primary — dropped DEFAULT_TERTIARY_MIRROR_URL, promoted .168 to DEFAULT_SECONDARY_MIRROR_URL as "Server 1 (OVH)". 2-entry default registry (.168 priority 0, tx1138 priority 10). Trusted-registry allowlist, catalog fallback, installer ISO registries, marketplaceData.ts REGISTRY, image-versions.sh all updated. Tests updated for new default counts (registry 3→2, mirror 3→2). URL-parser fixture tests in update.rs retain .23 strings intentionally — they exercise string-parsing logic, not policy.
  • 0ee16820 fix(config): auto-purge decommissioned .23 VPS from saved registry/mirror configsload_mirrors/load_registries normally only ADD missing defaults (explicit removals stick, by design). Existing nodes have .23 baked into their saved update-mirrors.json + config/registries.json and would pay timeouts forever against a dead host. Added targeted one-time migration in both loaders: .retain(|m| !m.url.contains("23.182.128.160")) before the defaults-merge step. Narrow-scope exception to the stickiness rule, documented in-code. Triggers lazily on next load (install RPC, update RPC, Settings UI open).
  • 008da477 docs(changelog): add v1.7.43-alpha entry covering async lifecycle + .23 retirement — 4 release-note bullets in AccountInfoSection.vue describing async-spawn, phase progress, scanner kick, and .23 retirement from the operator's perspective. Historical "Server 3 (OVH)" entries in older changelog blocks left intact — they describe what shipped at the time.

Deployed to .228:

  • Backend binary md5 d2b619949f19815faaeab10429e36ba0 at /usr/local/bin/archipelago.
  • Frontend at /opt/archipelago/web-ui/ (includes marketplaceData.ts .168 update + v1.7.43-alpha changelog entry). Deployed bundle verified: .168 present in Settings-*.js + Marketplace-*.js, .23 absent from all assets.
  • /var/lib/archipelago/update-mirrors.json + config/registries.json were manually deleted + regenerated with new defaults during Round 5 verification; migration code will handle any other node on first load.
  • Rollback targets from Round 2 still valid: /usr/local/bin/archipelago.bak-pre-async-install + /opt/archipelago/web-ui.bak-pre-async-install/.

Git remotes cleaned on .116 (working-copy change only, not in any commit):

  • git remote remove gitea-vps (dropped the .23 Gitea remote).
  • git remote set-url --delete --push origin http://.../23.182.128.160:3000/... (dropped .23 from origin multi-push alias).
  • Remaining push targets: tx1138 (canonical), gitea-local (localhost Gitea), gitea-vps2 (.168 OVH).

Rollback Rounds 35 (same command as Round 2 — backups predate all of this):

ssh archy228 'sudo cp -a /usr/local/bin/archipelago.bak-pre-async-install /usr/local/bin/archipelago && sudo rsync -a --delete /opt/archipelago/web-ui.bak-pre-async-install/ /opt/archipelago/web-ui/ && sudo systemctl restart archipelago && sudo systemctl reload nginx'

ASYNC-SPAWN LIFECYCLE FIX — SHIPPED (Stop/Start/Restart + Install/Uninstall/Update)

Round 2 (2026-04-23, install/uninstall/update) — 3 commits on main:

  • 2d5b859e feat(rpc): async-spawn install/uninstall/update lifecycle — new api/rpc/package/async_lifecycle.rs with spawn_package_install, spawn_package_uninstall, spawn_package_update. Dispatcher + handler thread self: Arc<Self> so spawned tasks own their Arc. Install/update Ok arms explicitly set Running because merge_preserving_transitional refuses to let the scanner overwrite Installing/Updating. Removed redundant inner "already updating" guard in update.rs. Transient install entry uses empty icon (see commit 3 rationale).
  • 0733ac40 fix(ui): shorten install/uninstall/update timeouts for async RPCs — drop 11m/45m timeouts to 15s across rpc-client.ts, stores/server.ts, and the 5 direct call sites in Marketplace.vue, Discover.vue, MarketplaceAppDetails.vue. Return types updated to { status, package_id }.
  • e471ef75 fix(rpc): empty icon in transient install entry to avoid broken-image flickerprogress.rs::create_installing_entry no longer hardcodes /assets/img/app-icons/<id>.png. About half of bundled apps use .svg/.webp icons; the frontend's fallback chain (backend_icon || curated.icon || placeholder) now lands on the correct curated extension.

Deployed to .228 (binary md5 f66857b3b8b3640c8cac8bd25fe508ec at /usr/local/bin/archipelago, backup at /usr/local/bin/archipelago.bak-pre-async-install; frontend at /opt/archipelago/web-ui/, backup at /opt/archipelago/web-ui.bak-pre-async-install/). User confirmed: uninstall fast and responsive, install of LND + SearXNG clean, icon flicker fixed.

Known out-of-scope issue: Vaultwarden container itself exits immediately on start with an internal error. The async wrapper correctly detects this via post-start exit verification and removes the state entry. Needs separate vaultwarden container-config investigation.

Rollback Round 2 (if ever needed):

ssh archy228 'sudo cp -a /usr/local/bin/archipelago.bak-pre-async-install /usr/local/bin/archipelago && sudo rsync -a --delete /opt/archipelago/web-ui.bak-pre-async-install/ /opt/archipelago/web-ui/ && sudo systemctl restart archipelago && sudo systemctl reload nginx'

Round 1 (Stop/Start/Restart) — 4 commits on main (unpushed per user mirror protocol):

  • 44cd5eef feat(rpc): spawn_transitional helper for async lifecycle ops — new api/rpc/transitional.rs with Op::{Stop,Start,Restart} and RpcHandler::spawn_transitional / flip_to_transitional / set_state helpers. install_log re-exported so sibling modules can use it.
  • 19a99ca9 fix(rpc): async container stop/start/restart; widen state mappingcontainer.rs start/stop rewritten + restart added; container-list now emits all transitional variants instead of falling back to "unknown". dispatcher.rs registers container-restart. package/runtime.rs mirrored with do_package_* helpers inside tokio::spawn and revert-on-error.
  • 6712810b fix(state): preserve transitional state across container scansserver.rs scan merge now keeps transitional states while taking fresh observability fields; 1200s stuck-timeout escape hatch via transitional_since: HashMap<String, Instant>. Three passing server::merge_tests.
  • 9ce28f08 fix(ui): single-button lifecycle control with transitional labelsContainerApps.vue and ContainerAppDetails.vue use a single primary button driven by getAppVisualState(). Dashboard now routes through container-start/container-stop (the async RPCs) instead of the legacy synchronous bundled-app-* path. ContainerStatus.vue widened to render all new variants.

Deployed to .228 (ThinkPad demo device):

  • Binary at /usr/local/bin/archipelago (md5 de86b63f74c7e6fe6e555ffe30b86b4f), backup at /usr/local/bin/archipelago.bak-pre-async-stop.
  • Frontend at /opt/archipelago/web-ui/, backup at /opt/archipelago/web-ui.bak-pre-async-stop/.
  • Release build took 3m56s on .116. Deploy via scp + atomic install -m 755 + systemctl restart archipelago. nginx -t + systemctl reload nginx for frontend.

Manual verification: user clicked Stop on LND in the dashboard. Button flipped to Stopping… instantly, held for the full graceful-stop window, transitioned to Start when podman stop completed. No mid-flight revert to Running. User sign-off: "absolutely beautiful".

Rollback (if ever needed):

ssh archy228 'sudo cp /usr/local/bin/archipelago.bak-pre-async-stop /usr/local/bin/archipelago && sudo rsync -a --delete /opt/archipelago/web-ui.bak-pre-async-stop/ /opt/archipelago/web-ui/ && sudo systemctl restart archipelago && sudo systemctl reload nginx'

Follow-ups to consider

  1. Chaos matrix / Step 11 — the original next-step gated behind this fix. Now unblocked.
  2. bundled-app-start / bundled-app-stop — still synchronous in the backend. Dashboard no longer calls them, but the RPC methods remain for any external caller. Decide: deprecate, or mirror the async-spawn treatment for parity.
  3. transitional_since persistence — currently in-memory only, so a backend restart mid-stop loses the timeout anchor. Acceptable for now (scan loop re-observes live podman state and reconciles), but worth revisiting if crash-recovery stories tighten.
  4. Test regressions inventory — the full cargo test -p archipelago run on .116 shows 22 pre-existing failures in unrelated modules (mesh/wallet/credentials/avatar/session/transport/update-mirrors/fips/identity_manager/image_versions). Unrelated to this work but tech debt. Log at /tmp/cargo-test-all.log on .116.
  5. Amend STATUS.md's older "NEXT SESSION — START HERE" section (below) — it is now stale. Left in place for historical reference of how the fix was designed; delete on the next pass if it gets confusing.

NEXT SESSION — START HERE (historical — fix above is now shipped)

Goal: implement async-spawn lifecycle fix so the dashboard never shows a frozen spinner again. User mandate: "best server containers in the world". Do not ship the chaos matrix (Step 11) until this lands and manual LND stop verifies instant RPC + live Stopping… label.

How to work on this repo (SSH + SSHFS setup)

You are likely running on the laptop (macOS). The repo lives on the ThinkPad (.116). There are two access paths, use both in parallel:

  1. SSHFS mount at ~/mnt/archy-thinkpad/ — for all file ops (read/edit/write/glob/grep).
  2. Direct SSH — for everything that isn't file ops: git, cargo, npm, systemctl, running the server, tailing logs.

See the "FUSE / SSHFS development loop" section below for the full mount lifecycle — that's the thing that makes this dev setup work, and it will break periodically.

FUSE / SSHFS development loop

Why this exists: editing the repo directly on the ThinkPad over raw SSH means no IDE, no tool-native file reads, no glob/grep speed. SSHFS mounts the remote filesystem as a local directory so OpenCode's file tools work transparently. But SSHFS is a leaky abstraction — know the gotchas or you'll waste hours.

Stack (macOS laptop):

  • macFUSE — kernel extension providing FUSE on macOS. Install via brew install --cask macfuse (requires reboot + security approval in System Settings the first time).
  • sshfs — userspace mount tool. Install via brew install gromgit/fuse/sshfs-mac (the homebrew core sshfs was removed; use this tap).
  • Verify: which sshfs/opt/homebrew/bin/sshfs, sshfs --versionSSHFS version 2.10 / FUSE library version 2.9.9.

Actual mount command currently running (verified from ps):

sshfs archy:Projects/archy /Users/dorian/mnt/archy-thinkpad \
  -o reconnect,ServerAliveInterval=15,ServerAliveCountMax=3,volname=archy-thinkpad

Breakdown:

  • archy:Projects/archy — remote path via the archy SSH alias (uses ~/.ssh/archy_opencode, no password prompt).
  • ~/mnt/archy-thinkpad — local mount point. Create once: mkdir -p ~/mnt/archy-thinkpad.
  • reconnect — sshfs auto-reconnects if the TCP session drops (WiFi flap, laptop sleep). Without this, the mount turns into a zombie immediately.
  • ServerAliveInterval=15 — sends a keepalive every 15s.
  • ServerAliveCountMax=3 — disconnect after 3 missed keepalives (45s). Tune up if your network is flaky.
  • volname=archy-thinkpad — Finder display name.

Check mount health:

mount | grep archy-thinkpad
# should print: archy:Projects/archy on /Users/dorian/mnt/archy-thinkpad (macfuse, nodev, nosuid, synchronous, mounted by dorian)

ls ~/mnt/archy-thinkpad/ | head
# should list repo contents fast (<1s). If it hangs, mount is stale.

Recovery when the mount hangs / goes stale (this WILL happen — laptop sleeps, WiFi drops, ThinkPad reboots):

# 1. Force-unmount (macOS — `umount` alone often fails on a hung FUSE mount)
sudo diskutil unmount force ~/mnt/archy-thinkpad
# fallback if diskutil can't see it:
sudo umount -f ~/mnt/archy-thinkpad

# 2. Kill any zombie sshfs process
pkill -f "sshfs archy:Projects/archy"

# 3. Remount
sshfs archy:Projects/archy ~/mnt/archy-thinkpad \
  -o reconnect,ServerAliveInterval=15,ServerAliveCountMax=3,volname=archy-thinkpad

# 4. Verify
ls ~/mnt/archy-thinkpad/ | head

If the mount point itself got wedged (ls: /Users/dorian/mnt/archy-thinkpad: Device not configured), the sequence above still works — macFUSE garbage-collects the inode after the force-unmount.

When to use which path (rules, not suggestions):

Operation Use Why
read / edit / write SSHFS mount OpenCode tools want local paths
glob / grep SSHFS mount Local FS traversal is fine; remote would need rg over SSH
Reading many files SSHFS mount Each read is a round-trip but parallelizable
git status / git diff / git log SSH Git over FUSE is painfully slow (lots of stat calls)
git add / git commit SSH Same — commit times grow linearly with tree size on FUSE
cargo check / cargo test / cargo build SSH Compiling over FUSE would take hours; cargo's incremental stat pattern destroys FUSE performance
npm install / npm run build SSH Same reason — massive file churn
Running the server / tailing journal SSH Service lives on .116
Deploying to .228 SSH from .116 SCP from ThinkPad; laptop isn't in the critical path

Don't do this (will bite you):

  • cargo build from the mount — will try to write target/ over FUSE, gets orders of magnitude slower, may hang.
  • rsync without --exclude="._*" — macOS writes AppleDouble metadata files, they leak to the remote as ._* siblings of every real file. .gitignore already excludes them (commit 13858842), but they clutter the tree.
  • Writing big binary files via the mount — use scp over SSH instead.
  • Relying on file-change-watcher tools (watchman, chokidar) — they get confused by FUSE event semantics.

Editing workflow in a typical session:

  1. Laptop: OpenCode reads a file via /Users/dorian/mnt/archy-thinkpad/.... FUSE fetches it over SSH, caches briefly.
  2. Laptop: OpenCode edits the file — FUSE writes the new bytes back to .116 immediately (synchronous mount).
  3. Laptop: ssh archy "cd ~/Projects/archy && ~/.cargo/bin/cargo check -p archipelago" — runs on the real filesystem on .116, sees the edit.
  4. Laptop: ssh archy "cd ~/Projects/archy && git diff path/to/file" — confirms the edit landed.
  5. Laptop: ssh archy "cd ~/Projects/archy && git add path/to/file && git commit -m '...'" — commit from .116.

The SSHFS mount and the SSH shell are pointing at the same inodes — edits via the mount are instantly visible to cargo/git over SSH. There's no "sync" step.

Cache caveat: macFUSE caches attributes briefly (default ~1s). If you write via SSH and read via the mount within that window, you may see stale metadata. The mount's synchronous flag (visible in mount output) minimizes but doesn't eliminate this. If you get a weird diff between what SSH and the mount report, re-read after a second, or stat --file-system ~/mnt/archy-thinkpad/<file> to force a refresh.

Direct SSH access (use when FUSE isn't the right tool):

  • ssh archyarchipelago@192.168.1.116 using ~/.ssh/archy_opencode
  • ssh archy228archipelago@192.168.1.228 using ~/.ssh/archy_opencode
  • Full host form also works: ssh archipelago@192.168.1.116 / ssh archipelago@192.168.1.228 (same key resolves via IdentitiesOnly).

SSH keys — what's where

Laptop ~/.ssh/ (macOS, user dorian):

File Purpose
archy_opencode / .pub Primary key for this project. Unlocks both archy (.116) and archy228 (.228). Created 2026-04-22 specifically for OpenCode work.
archipelago-deploy / .pub Older archipelago deploy key. Not needed for current work.
id_ed25519 / .pub Personal default key. Not used by archy/archy228 configs (IdentitiesOnly yes forces archy_opencode).
id_ed25519_angor / .pub Angor project. Unrelated.
id_ed25519_start9 / .pub Start9 project. Unrelated.
vps-ci-setup / .pub VPS CI. Unrelated.
config Host aliases (shown above)

.116 /home/archipelago/.ssh/:

File Purpose
authorized_keys Accepts: laptop's archy_opencode.pub + 3 other keys (4 lines total).
id_ed25519 / .pub .116's OWN identity key. This is what lets .116 → .228 work passwordless.
archipelago-deploy Symlink → id_ed25519 (legacy alias).
id_ed25519_vps168 / .pub For SSH to 146.59.87.168 (VPS). Unrelated to this work.
config Host entry for the VPS only.

.228 /home/archipelago/.ssh/:

File Purpose
authorized_keys Accepts: laptop's archy_opencode.pub + .116's id_ed25519.pub + 2 others (4 lines total).
(no id_ed25519) .228 has no outbound key — it's a terminal node. Don't try to ssh from .228 to anywhere.

Connectivity matrix (all verified 2026-04-23):

From → To Works passwordless Via
Laptop → .116 archy_opencode
Laptop → .228 archy_opencode
.116 → .228 .116's id_ed25519
.228 → anywhere no outbound key (by design)

Sudo — verified state

.116 (dev ThinkPad):

  • User archipelago is in sudo group.
  • Sudo password required: ThisIsWeb54321@
  • Sudoers drop-ins present: /etc/sudoers.d/archipelago-ci, /etc/sudoers.d/archipelago-wg (scope-limited NOPASSWD for specific CI/wg commands — not full NOPASSWD).
  • For most dev work you don't need sudo on .116.

.228 (prod kiosk):

  • User archipelago has full passwordless sudo via /etc/sudoers.d/archipelago containing archipelago ALL=(ALL) NOPASSWD:ALL.
  • User is also in sudo group.
  • Sudo password (if ever prompted, shouldn't be): archipelago
  • Dashboard password: password123

Cargo / npm / paths

  • Cargo PATH gotcha: non-interactive SSH login has no cargo in PATH. Always use ~/.cargo/bin/cargo over SSH.
    • Example: ssh archy '~/.cargo/bin/cargo check -p archipelago' --workdir ~/Projects/archy/core
    • Or cd first: ssh archy 'cd ~/Projects/archy && ~/.cargo/bin/cargo check -p archipelago'
  • Long cargo builds (>2 min Bash tool timeout): launch detached and poll the log:
    ssh archy 'cd ~/Projects/archy && nohup ~/.cargo/bin/cargo build --release -p archipelago > /tmp/cargo-build.log 2>&1 < /dev/null & disown'
    ssh archy 'tail -30 /tmp/cargo-build.log'
    ssh archy 'pgrep -a cargo'   # to check if still running
    
  • npm / frontend lives at ~/Projects/archy/neode-ui/ on .116 (also accessible via laptop mount at ~/mnt/archy-thinkpad/neode-ui/). Node is on interactive PATH; for scripted SSH, source ~/.nvm/nvm.sh && nvm use or call the absolute path if nvm is used.
  • Repo on .116: ~/Projects/archy/ (Cargo workspace at core/Cargo.toml).
  • Web root on .228: check /etc/nginx/sites-enabled/ for the live path; historically /var/lib/archipelago/web-ui/ or /opt/archipelago/web-ui/.

Deploying new server binary to .228

# 1. Build on .116 (detached — takes ~3-5 min for release)
ssh archy 'cd ~/Projects/archy && nohup ~/.cargo/bin/cargo build --release -p archipelago > /tmp/cargo-build.log 2>&1 < /dev/null & disown'
# wait / tail log until "Finished `release` profile"

# 2. SCP .116 → .228 (uses .116's id_ed25519 → .228's authorized_keys, passwordless)
ssh archy 'scp ~/Projects/archy/core/target/release/archipelago archipelago@192.168.1.228:/tmp/archipelago.new'

# 3. Atomic swap on .228 with backup
ssh archy228 'sudo cp /usr/local/bin/archipelago /usr/local/bin/archipelago.bak-pre-async-stop && sudo mv /tmp/archipelago.new /usr/local/bin/archipelago && sudo chmod +x /usr/local/bin/archipelago && sudo systemctl restart archipelago'

# 4. Verify
ssh archy228 'systemctl status archipelago --no-pager | head -20 && sudo journalctl -u archipelago -n 50 --no-pager'

Git workflow

  • Branch: main on .116, currently 22 commits ahead of tx1138/main.
  • Remote tx1138 exists but do NOT push — user mirrors to 4 Gitea remotes personally after reviewing.
  • Atomic commits, one logical change per commit. Conventional Commits format (feat:, fix:, docs:, refactor:, chore:, test:, perf:).
  • Never --amend unless the commit you're amending was created in this session AND has not been pushed. Safer: new commit.
  • Never --force push. Never modify git config.
  • If pre-commit hooks fail, create a NEW commit with the fix — don't --amend after a failed commit.

Other

  • Full destructive latitude on both nodes. Announce multi-hour ops (OTA, full rebuild, apt upgrade). Don't ask for routine stop/start/rebuild permission.
  • No ship pressure. Do it properly.
  • Use question tool for ambiguous decisions (don't guess user intent on design choices).
  • Keep docs/STATUS.md fresh between sessions — it IS the session handoff.

Hosts reference (quick)

Host IP SSH alias Role Dashboard Sudo
archy (ThinkPad X250) 192.168.1.116 ssh archy dev host, Debian 13 archipelago ThisIsWeb54321@
archy228 (HP ProDesk) 192.168.1.228 ssh archy228 prod kiosk, Rust orchestrator password123 NOPASSWD (fallback archipelago)

Bug being fixed

Dashboard sequence when user clicks Stop LND:

  1. UI collapses Start/Stop buttons to single spinner-button ("Stopping…") via loadingApps.add('lnd').
  2. Frontend calls container-stop RPC. Server runs podman stop -t 330 lnd synchronously inside the RPC handler (via orchestrator.stop()). RPC blocks up to 5.5 min for LND (330s timeout + overhead).
  3. Meanwhile the 30-second package-scan loop in server.rs:scan_and_update_packages keeps running. It rebuilds PackageDataEntry from podman inspect — podman still reports running (stop hasn't completed) — and blindly overwrites the store entry at server.rs:854.
  4. container-list RPC reads state_manager snapshot → returns state = "running".
  5. Frontend polling sees runninggetAppState() returns 'running' → the two-button (Start | Stop) block re-renders → the transitional button disappears → UI looks like the stop silently failed.
  6. Eventually podman stop finishes → next scan → state flips to Stopped → buttons change again.

Net visible bug: button spins briefly, reverts to Running, then several minutes later suddenly shows Stopped. User rightly calls this "out of sync and confusing".

Decisions already locked in (do not re-ask)

  • Full scope fix (not minimal hotfix). User chose "Go full scope, do it right".
  • Async-spawn lives in the RPC layer, not in the ContainerOrchestrator trait. Trait stays synchronous so the reconciler, boot flow, unit tests, and the chaos harness retain deterministic behaviour.
  • PackageState already has Stopping/Starting/Restarting/Installing/Updating/Removing variants — enum at core/archipelago/src/data_model.rs:107-124. No schema change needed.
  • UI collapses to one full-width button with spinner during every transitional state. Labels: Start / Stop / Starting… / Stopping… / Restarting… / Installing… / Updating… / Removing… / Install (when not-installed).
  • Helper API shape: RpcHandler::spawn_transitional(op: Op, app_id: String) where Op is an enum {Stop, Start, Restart}. Helper dispatches to orchestrator.stop/start/restart internally, knows each op's transitional+final states, handles error → revert + install_log().
  • mark_user_stopped must run BEFORE the spawn (preserves ordering the crash recovery layer depends on — see runtime.rs:145-148).

Implementation order (4 commits, local only)

Commit 1 — feat(rpc): spawn_transitional helper for async lifecycle ops

  • New file: core/archipelago/src/api/rpc/transitional.rs (or extend container.rs; prefer new file for cohesion with future stacks/package variants)
  • enum Op { Stop, Start, Restart } with transitional_state(), final_state_on_success(), log_prefix(), and async dispatch(&orch, &app_id) method
  • impl RpcHandler { pub(super) async fn spawn_transitional(&self, op: Op, app_id: String) -> Result<()> }
    • Capture Arc<dyn ContainerOrchestrator> + Arc<StateManager> clones
    • Set transitional state via state_manager.update_data() (if entry exists; skip if not — Start on never-installed shouldn't create an entry)
    • tokio::spawn(async move { ... })
    • Inside spawn: install_log("{LOG_PREFIX}: {app_id}"), op.dispatch(&orch, &app_id).await, on success set final state, on error log + install_log("{LOG_PREFIX} FAIL: …") + revert state to previous (cache pre-transition state in a local)
    • Return Ok(()) immediately after spawn

Commit 2 — fix(rpc): async container stop/start/restart; widen state mapping

  • api/rpc/container.rs:85-107 — rewrite handle_container_stop body: validate_app_id, mark_user_stopped, spawn_transitional(Op::Stop, app_id.to_string()).await?, return Ok(json!({ "status": "stopping" }))
  • api/rpc/container.rs:61-83 — rewrite handle_container_start: clear_user_stopped, spawn_transitional(Op::Start, …), return { "status": "starting" }
  • Add handle_container_restart (currently missing in container.rs — only exists as package.restart at runtime.rs:176-242). Register RPC route name container-restart. Add matching frontend client method in container-client.ts.
  • api/rpc/container.rs:148-154 — widen the container-list state mapping: add arms for Stopping → "stopping", Starting → "starting", Restarting → "restarting", Installing → "installing", Updating → "updating", Removing → "removing", Installed → "installed", CreatingBackup/RestoringBackup/BackingUp → their kebab-case strings. No more "unknown" fallback unless the variant is genuinely unknown.
  • Mirror same spawn treatment in api/rpc/package/runtime.rs: handle_package_start (L28-119), handle_package_stop (L122-173), handle_package_restart (L176-242). Keep the existing verification loops (post-start exit-check at L82-117; restart stop+start fallback at L215-235) inside the spawned future, not in the RPC body.

Commit 3 — fix(state): preserve transitional state across container scans

  • server.rs:847-857 — in the merge loop, before the merged.insert(id.clone(), pkg.clone()) overwrite, check merged.get(id).state and skip overwrite if it's transitional: matches!(existing.state, Installing | Stopping | Starting | Restarting | Updating | Removing | CreatingBackup | RestoringBackup | BackingUp)
  • Still allow non-state fields (lan_address, health, ports) to update. Simplest: when existing is transitional, keep existing.state but merge updated fields from pkg. Write a tiny helper merge_preserving_transitional(existing, fresh) -> PackageDataEntry.
  • Unit test: construct existing.state = Stopping, fresh.state = Running, assert merged.state stays Stopping.
  • Also check: Is there a timeout escape hatch? If Stopping is set and podman actually finishes but the spawn died before writing the final state (process crash, panic), the entry will be stuck Stopping forever. Mitigation: track a transitional_since: Instant in the entry (not persisted, just in-memory side table on StateManager), and if > 2× the stop timeout has elapsed, allow podman scan state to override. Scope for this commit or follow-up — lean toward: include it, because fleet reliability matters.

Commit 4 — fix(ui): single-button lifecycle control with transitional labels

  • neode-ui/src/api/container-client.ts — extend ContainerStatus.state union to: 'created' | 'running' | 'stopped' | 'exited' | 'paused' | 'unknown' | 'stopping' | 'starting' | 'restarting' | 'installing' | 'updating' | 'removing' | 'installed'. Add restartContainer(appId) method calling container-restart.
  • neode-ui/src/stores/container.ts — add computed getAppVisualState(appId) that returns one of: 'not-installed' | 'running' | 'stopped' | 'starting' | 'stopping' | 'restarting' | 'installing' | 'updating' | 'removing'. Maps exitedstopped, createdstopped, pausedstopped, installedstopped. Add restartContainer(appId) action (sets loadingApps for request dedup, calls client, does NOT fetchContainers immediately because server will broadcast state; a final fetchContainers after a short delay can backstop if WebSocket push is absent).
  • neode-ui/src/views/ContainerApps.vue:85-136 — replace the two-button conditional with a single full-width button bound to getAppVisualState(app.id). Table:
    visual state click action label spinner disabled
    not-installed installApp Install no no
    running stopContainer Stop no no
    stopped startContainer Start no no
    starting Starting… yes yes
    stopping Stopping… yes yes
    restarting Restarting… yes yes
    installing Installing… yes yes
    updating Updating… yes yes
    removing Removing… yes yes
    • Add a separate Restart button next to the primary one when state is running, calling new restartContainer action. Restart button hides while transitional.
  • neode-ui/src/views/ContainerAppDetails.vue:83 (and full stop/start button blocks around L220, L232) — mirror the same single-button pattern.
  • Also audit line 239 of ContainerApps.vue (some((app) => store.getAppState(app.id) === 'created')) and the logic around lines 276, 295, 309, 312 — make sure they use getAppVisualState where appropriate.

Verification gates (do not skip)

  1. ~/.cargo/bin/cargo check -p archipelago on .116 via SSH
  2. ~/.cargo/bin/cargo test -p archipelago on .116 via SSH — at least the new merge helper test must pass
  3. Build release binary on .116: nohup ~/.cargo/bin/cargo build --release -p archipelago > /tmp/cargo-build.log 2>&1 < /dev/null & disown. Poll until done.
  4. SCP binary to .228 /usr/local/bin/archipelago, back up prior to /usr/local/bin/archipelago.bak-pre-async-stop. sudo systemctl restart archipelago on .228.
  5. Manual LND stop test on .228:
    • Open dashboard, confirm LND is Running (first: ssh archipelago@192.168.1.228 'podman start lnd' — LND is currently Exited(0) from the demo)
    • Click Stop
    • Expected: button immediately becomes "Stopping…" with spinner (RPC returns <1s)
    • Dashboard should stay on "Stopping…" for ~5 min
    • Then flip to "Start" button with label "Start"
    • At no point should it revert to "Running" mid-stop
  6. Same test with Bitcoin Core stop (longest timeout, 600s)
  7. Frontend build: cd ~/Projects/archy/neode-ui && npm run type-check && npm run build. Rsync dist/ to archipelago@192.168.1.228:/var/lib/archipelago/web-ui/ (or wherever the active web root is — check /etc/nginx on .228 first).
  8. Then and only then: resume chaos matrix. First recover LND/ElectrumX via UI (great end-to-end test of the new async Start path), then run smoke → full 32-case matrix.

Key files (exact lines of interest)

  • core/archipelago/src/api/rpc/container.rs:85-107handle_container_stop (blocking — target of fix)
  • core/archipelago/src/api/rpc/container.rs:61-83handle_container_start
  • core/archipelago/src/api/rpc/container.rs:148-154 — narrow state mapping (drops transitional → "unknown")
  • core/archipelago/src/api/rpc/package/runtime.rs:11-24stop_timeout_secs table (reference, unchanged)
  • core/archipelago/src/api/rpc/package/runtime.rs:122-173handle_package_stop (also blocking, mirror treatment)
  • core/archipelago/src/api/rpc/package/runtime.rs:28-119handle_package_start
  • core/archipelago/src/api/rpc/package/runtime.rs:176-242handle_package_restart
  • core/archipelago/src/api/rpc/package/progress.rs — existing broadcast pattern to mirror (set_install_progress, set_uninstall_stage)
  • core/archipelago/src/api/rpc/mod.rs:62-100RpcHandler struct (already holds Arc<dyn ContainerOrchestrator> + state_manager)
  • core/archipelago/src/server.rs:812-857scan_and_update_packages (merge loop at L850-857 is where transitional-state clobber happens)
  • core/archipelago/src/container/docker_packages.rs:636-663convert_state + package_state_str (read-only reference, no change)
  • core/archipelago/src/container/traits.rsContainerOrchestrator trait (stays synchronous, do not change)
  • core/archipelago/src/crash_recovery.rsmark_user_stopped / clear_user_stopped (call order preserved)
  • core/archipelago/src/data_model.rs:107-124PackageState enum (no change — all variants exist)
  • neode-ui/src/api/container-client.tsContainerStatus type + RPC methods (extend)
  • neode-ui/src/stores/container.ts:93-312 — Pinia store (add getAppVisualState, add restartContainer action)
  • neode-ui/src/views/ContainerApps.vue:85-136, 239, 276, 295, 309-312, 383 — two-button block + state reads
  • neode-ui/src/views/ContainerAppDetails.vue:83, 220, 232 — details page Stop/Start

Chaos harness (not in repo — lives on .116)

  • archipelago@192.168.1.116:~/ui-chaos/ — deployed, playwright + deps installed, smoke test for bitcoin-core passes (2.1 min). LND/ElectrumX/bitcoin-ui smoke tests not yet run (blocked on the async-stop fix landing; LND currently Exited on .228 from the demo).
  • /tmp/chaos/ on laptop — canonical source for rsync to .116.
  • Run: cd ~/ui-chaos && npx playwright test tests/<spec>
  • Target: 32 cases = 4 core containers × 8 scenarios (install-fresh, graceful-stop, sigkill, rm-container, oom-kill, rm-image, restart-service, network-partition).
  • Uses SSH+Playwright hybrid per design; includes the bash -lc '<escaped>' single-quote fix for ssh argv flattening and JSON-parsed podman inspect instead of Go templates.

Pre-existing bugs still deferred (do not fix until Stop UX lands)

  1. archipelago --version spawns server (should be a pure CLI query)
  2. RPC unknown-method returns generic error (should return method-not-found with the bad method name)
  3. docker_packages.rs filters out UI containers (archy-lnd-ui, archy-electrs-ui) — some views need them visible
  4. lnd.lan_address stale on .228
  5. first-boot silent failure on some hardware
  6. web-ui.failed.* scar on .228 (benign systemd unit state)
  7. test_parse_image_versions pre-existing broken assertion — fix or #[ignore] when touching that area

Where we are

Working through the 11-step plan in rust-orchestrator-migration.md.

  • Step 13767c267 ContainerConfig schema with build:, ResolvedSource enum, resolve(), 10 tests
  • Step 234af4d9d ContainerRuntime trait gained image_exists + build_image, 4 argv tests, 25/25 pass
  • Step 3b6a04d31 ProdContainerOrchestrator (999 LOC), 16 tests all pass, not yet wired to main.rs
  • Step 4e8a59c93 ContainerOrchestrator trait, RpcHandler uses it in prod (+ 13858842 chore gitignore ._*)
  • Step 5fc39b04b BootReconciler with Arc shutdown, 4 paused-time tests pass
  • Step 648f08aa3 main.rs wire-up (orchestrator construction + adopt_existing + BootReconciler spawn + shutdown Notify)
  • Step 7069bc4a5 bitcoin-ui pre-start hook + embedded nginx.conf template (8 unit tests + 1 integration test), 39/39 container:: tests pass
  • Step 8aa0707f4d retire archipelago-reconcile.{service,timer} + ISO builder touchpoints, keep scripts for update.rs
  • Step 9Hot-swap on .228 verified. All three UIs (bitcoin-ui/lnd-ui/electrs-ui) installing + serving HTTP 200.
  • .228 dashboard bugs — ExtraHost 192.168.1.254 bug (3ee192ba) + LND macaroon permission bug (be960023). See "Post-Step 9 bug hunt" below.
  • Step 8b — Port remaining ~25 container creations from first-boot-containers.sh into apps/<id>/manifest.yml, then port update.rs to orchestrator (deferred, multi-day work)
  • Step 8c — Rename first-boot-containers.shfirst-boot-setup.sh, strip container ops, keep setup. Delete reconcile-containers.sh + container-specs.sh. Add ISO lines to copy apps/ (final one-way door, requires 8b complete)
  • Step 10 — Hot-swap + verify on .116 (adoption-heavy test — .116 already has all containers running)
  • Step 11 — Chaos matrix on both nodes (all 8 scenarios × all containers incl. bitcoin-core)

Post-Step 9 bug hunt (.228, 2026-04-23)

User reported three visible dashboard bugs after Step 9 verification:

  1. LND — "no connect details or QR"
  2. ElectrumX — stuck at "Building index (2 KB / ~130 GB)" for days
  3. bitcoin-core — in scope for chaos testing

Root cause #1 (ExtraHost, commit 3ee192ba): scripts/first-boot-containers.sh computed HOST_GATEWAY from ip route show default, which returns the LAN router (e.g. 192.168.1.254), not the gateway to the host. Every container configured with --add-host=host.containers.internal:$HOST_GATEWAY was dialing the WiFi router instead of the host. LND crash-looped with dial tcp 192.168.1.254:8332: connection refused; ElectrumX's DAEMON_URL hit the same dead end; any archy-net bridge consumer of bitcoin-core's RPC was broken. Fixed by replacing the computed value with podman's magic host-gateway literal (supported since 4.4; we ship 5.4.2). Live-recreated bitcoin-core/electrumx/lnd on .228 with the corrected --add-host; LND reached chain backend; ElectrumX resumed indexing (went from 2 KB → 164.9 MB in under an hour).

Root cause #2 (macaroon permissions, commit be960023): LND's admin.macaroon lives at /var/lib/archipelago/lnd/data/chain/bitcoin/mainnet/admin.macaroon, owned by rootless-podman subordinate UID 100000, mode 640. The archipelago server runs as host UID 1000 and literally cannot read the file. Every LND RPC (getinfo, connect-info, export-channel-backup) plus the shared lnd_client() helper failed with "Failed to read LND admin macaroon". Confirmed pre-existing on .116 too (long-standing bug unrelated to Step 9). Fix: centralised the path as LND_ADMIN_MACAROON_PATH, added a read_lnd_admin_macaroon() helper in api/rpc/lnd/mod.rs that tries direct read first then falls back to sudo -n cat (mirrors the pattern already used for Tor onion hostnames). Four call sites routed through the helper. Verified on .228 — curl -k https://<host>/lnd-connect-info now returns 200 with cert + macaroon + tor_onion; dashboard QR unblocked.

Step 9 evidence (.228, 2026-04-23)

  • Binary: Step 9 build with 732df1b8 + ba83f9bc, scp'd to .228 as /usr/local/bin/archipelago. Old binary backed up at /usr/local/bin/archipelago.bak-pre-step9. Later replaced with macaroon-fix build (be960023); previous backed up at /usr/local/bin/archipelago.bak-pre-macaroon.
  • DEV_MODE override disabled (override.confoverride.conf.disabled-pre-step9).
  • /opt/archipelago/apps/{bitcoin-ui,electrs-ui,lnd-ui}/manifest.yml populated.
  • /opt/archipelago/docker/bitcoin-ui/Dockerfile replaced with the Step 7 version (no COPY nginx.conf). Old dir backed up as bitcoin-ui.bak-pre-step9.
  • Post-start snapshot:
    • 🔗 Adopted 1 existing container(s): ["electrs-ui"] — adoption of 13h-running container worked without recreation
    • 🔄 Boot reconciler started (interval: 30s) — every 30s, all three app_ids reach NoOp after the initial install pass
    • bitcoin-ui nginx.conf rendered path=/var/lib/archipelago/bitcoin-ui/nginx.conf auth_hash=97af1c18 — pre-start hook fires in install_fresh
    • curl localhost:8334 → HTTP 200 (bitcoin-ui), :8081 → 200 (lnd-ui), :50002 → 200 (electrs-ui)
    • OCI memory limits correctly applied: bitcoin-ui=128Mi, electrs-ui=128Mi, lnd-ui=64Mi (was emitted as 0 pre-fix)

Bugs fixed this session

  1. parse_memory_limit truncation bug (732df1b8): lowercased "128Mi" → "128mi" → trim_end_matches('m') → "128i" → f64 parse fails → None.unwrap_or(0) → OCI memory.limit:0 → systemd rejects MemoryMax=0. 6 regression tests; create_container now omits instead of emitting 0.
  2. archipelago.service cgroup delegation missing (ba83f9bc): belt-and-braces Delegate=memory pids cpu io.
  3. ExtraHost 192.168.1.254 (3ee192ba): see Post-Step 9 bug hunt above.
  4. LND admin.macaroon unreadable (be960023): see Post-Step 9 bug hunt above.

Commits made this session

3ee192ba fix(first-boot): use podman host-gateway magic for host.containers.internal
be960023 fix(lnd): read admin macaroon via sudo fallback
4b8ef0a0 docs: STATUS.md through Step 9 (.228 hot-swap verified)
ba83f9bc feat(systemd): delegate cgroup controllers to archipelago.service
732df1b8 fix: parse_memory_limit accepts Ki/Mi/Gi IEC binary suffixes
a0707f4d refactor: retire archipelago-reconcile.{service,timer}  (Step 8a)
1c81a739 docs: split Step 8 into 8a/8b/8c
6e46932f docs: STATUS.md through Step 7
069bc4a5 feat: bitcoin-ui pre-start hook (Step 7)

Branch is 19 commits ahead of tx1138/main (local only — user pushes to mirrors personally).

Uncommitted state

Clean. Only untracked: tests/ (bats harness from prior session, not in scope), tmp-dump-spec.py (scratch).

Answered design questions (no need to re-ask)

  1. UI container naming → archy-<app_id> for UIs only; existing bitcoin-knots/lnd/electrumx keep bare names
  2. BITCOIN_RPC_AUTH injection → runtime bind-mount of nginx.conf (no build-args, no envsubst)
  3. Reconciler interval → 30 seconds
  4. Concurrency → per-app Mutex<()> in a DashMap
  5. Bash scripts → split into 8a/8b/8c; 8a done, 8b/8c deferred
  6. Step 4 extension → ContainerOrchestrator trait includes install(app_id); the manifest_path-based install RPC stays dev-only
  7. Step 7 bitcoin-ui template → embed via include_str!, render on install + every reconcile, atomic tmp+rename to /var/lib/archipelago/bitcoin-ui/nginx.conf, bind-mount into container. RPC user hardcoded archipelago, password from /var/lib/archipelago/secrets/bitcoin-rpc-password.

Context: which host is what

Host IP Role Dashboard pw Sudo pw
archy 192.168.1.116 Dev ThinkPad (Lenovo X250, Debian 13). Currently running v1.7.42-alpha (DEV_MODE). Step 10 target. archipelago ThisIsWeb54321@
archy228 192.168.1.228 Kiosk HP ProDesk. Step 9 landing zone — now running Rust-orchestrator binary in prod mode. password123 archipelago

Both are development alpha nodes — full destructive latitude, no need to ask before stop/start/rebuild.

Next action

Step 10 — Hot-swap on .116.

Unlike .228 (which tested the INSTALL path for net-new UI containers), .116 tests the ADOPTION path: it already has all three UIs and all backend containers running from prior v1.7.42-alpha runs. We want to verify the new prod orchestrator adopts every existing container without recreating or restarting them.

Steps:

  1. Disable DEV_MODE on .116 (check if override.conf exists — /etc/systemd/system/archipelago.service.d/)
  2. Stage the already-built binary at ~/Projects/archy/core/target/release/archipelago/usr/local/bin/archipelago.new
  3. Ensure /opt/archipelago/apps/{bitcoin-ui,electrs-ui,lnd-ui}/manifest.yml present (copy from repo)
  4. Ensure /opt/archipelago/docker/bitcoin-ui/ matches the Step-7 layout (no baked nginx.conf)
  5. Snapshot: podman ps -a --format "{{.Names}}\t{{.Status}}\t{{.CreatedAt}}" → save to /tmp/pre-step10-containers.txt
  6. systemctl stop archipelago → install binary → systemctl start archipelago
  7. Verify in journal: every running container appears in "Adopted N existing container(s)"; no container was recreated; all HTTP smokes still 200; BootReconciler reaches NoOp on every app_id after one pass.
  8. If broken → restore .bak binary, re-enable DEV_MODE override.
  9. Commit STATUS.md update.

Risk on .116: If adoption fails mid-flight, we'd lose the running v1.7.42 backend that I'm currently typing at. Keep a second SSH session open to the ThinkPad for emergency revert. The backup plan is install /usr/local/bin/archipelago.bak /usr/local/bin/archipelago && systemctl restart archipelago.

After Step 10 we are blocked on Step 8b (multi-day manifest ports) before Step 11 (chaos matrix).


Why Step 8 got split (discovered 2026-04-23)

Original plan was one commit "delete bash + edit ISO builder". But on investigation:

  • first-boot-containers.sh creates 30+ containers with per-container logic (wallets, DB init, rpcauth derivations, post-create health waits). The repo only has manifests for 3 (bitcoin-ui, electrs-ui, lnd-ui from Step 7). Deleting bash now = brick first-boot on fresh installs.
  • Script also does non-container setup: secret generation (RPC pw, DB pw, FileBrowser admin pw), UID-mapping chowns for rootless podman subuid, Tor hostnames dir, WireGuard, firewall rules, nostr-relay dir. None of this lives in the Rust orchestrator.
  • update.rs (OTA update RPC) invokes reconcile-containers.sh at two sites. Deleting the script breaks package updates. Porting those call sites to the orchestrator needs all containers to have manifests.
  • Design doc §505 updated to split 8 → 8a/8b/8c. Only 8a (delete the reconcile systemd unit + timer, BootReconciler covers) is safe to execute before we port manifests.

Archipelago — Current State, Plan, and Releases

Updated: 2026-04-22

This is the "pick this up tomorrow" page. One-stop summary of where we are, what the plan is, and what's shipped. Detailed plan lives in bulletproof-containers.md.


Current state

Fleet status

All four Gitea mirrors are synced to v1.7.40-alpha:

Mirror Host Status
tx1138 https://git.tx1138.com v1.7.40-alpha live
gitea-local http://localhost:3000 v1.7.40-alpha live
.160 http://23.182.128.160:3000 v1.7.40-alpha live (Gitea recovered via podman system renumber — see below)
.168 http://146.59.87.168:3000 v1.7.40-alpha live

Fleet test nodes:

Node Version State
.103 (dev) 1.7.40 running, being developed against
.116 (this box) 1.7.40 healed manually via systemd-run chmod 755 /opt/archipelago/web-ui after v1.7.38/39 bug
.198 1.7.39 → 1.7.40-alpha healed manually
.228 (primary test) 1.7.40-alpha healed manually; bitcoin-core + lnd + electrumx running; UI companions currently missing; bitcoin.conf rpcauth patched live
.249 (ISO test) unreachable today
.253 1.7.39 → 1.7.40-alpha healed manually

Known open issues (drives the plan below)

  1. UI companion containers disappear on .228 after daemon restarts — no auto-recreate (fixed by v1.7.45 Quadlet migration)
  2. bitcoin.conf rpcauth drifts from canonical secret → ElectrumX "Daemon connection problem" (fixed by v1.7.43 reconcile::derived)
  3. host.containers.internal resolves to LAN gateway inside containers on some versions (fixed by v1.7.42 containers.conf)
  4. Podman state DB loss requires manual recovery (fixed by v1.7.44 startup self-heal)
  5. LND "Connect Wallet" info vanishing after crashes — symptom of the same drift class as #2
  6. ElectrumX not syncing on .228 — downstream of #2; will resolve when bitcoin.conf is reconciled

Recent field incident (2026-04-22)

  • Shipped v1.7.38 + v1.7.39, both broke nginx fleet-wide because the frontend tarball's root dir was drwx------ (700). Every node that OTA'd got 500 errors on every page.
  • Root-cause fix shipped in v1.7.40 (create-release-manifest.sh chmod + pre-ship assertion that tar tvzf | head -1 shows drwxr-xr-x).
  • .160 Gitea was down all day (502) because its rootless podman's libpod/bolt_state.db had vanished. Recovered via clearing /run/user/$UID/{containers,libpod,podman} + podman system renumber.
  • Full failure-mode audit is in bulletproof-containers.md.

Plan

We're shipping a level-triggered reconciler + Quadlet architecture over six incremental releases. Each release closes one failure mode. See bulletproof-containers.md for the full design, code layout, test harness, chaos matrix, sources.

Release roadmap

Release Closes What lands Status
v1.7.41 FM5 (bad OTA nginx 500) Post-OTA auto-rollback. New binary probes https://127.0.0.1/ on boot; if non-200 within 90s, restores web-ui.bak + calls rollback_update() + restarts in flight — deploying to .228 for test
v1.7.42 FM4 (host.containers.internal wrong) /etc/containers/containers.conf w/ host_containers_internal_ip = 10.89.0.1; every container gets --add-host=host.archipelago:10.89.0.1 pending
v1.7.43 FM2 (config drift) reconcile::derived::render_bitcoin_conf — pure fn over canonical secret, rewrites on drift. Same for lnd.conf pending
v1.7.44 FM6 (podman state loss) Startup probe detects broken podman state, auto-recovers via /run/user/$UID/* clear + system renumber pending
v1.7.45 FM1 + FM3 (companion orphans) archy-bitcoin-ui → Quadlet .container unit in /etc/containers/systemd/. systemd (not archipelago) owns it pending
v1.7.46 archy-lnd-ui → Quadlet pending
v1.7.47 archy-electrs-ui → Quadlet pending
v1.7.48+ all (full daemon refactor) core/archipelago/src/reconcile/ module replaces imperative install.rs container management. Main app containers become Quadlet too pending

Test harness (bats + Goss + Chaos Toolkit + vmtest) lands scaffold in v1.7.41, first lifecycle tests blocking v1.7.45, full matrix blocking beta tag.


Release history

v1.7.41-alpha — IN FLIGHT — 2026-04-22

Post-OTA auto-rollback. After an update lands, the node probes its own web UI through nginx — if the frontend isn't answering cleanly within 90 seconds, the node automatically rolls back to the previous version and restarts. A bad release can no longer leave the fleet stranded on an unreachable node.

Changes:

  • core/archipelago/src/update.rs: PendingVerification struct, write marker before service restart, verify_pending_update() on new binary boot — probes https://127.0.0.1/, on fail restores web-ui.bak + calls rollback_update() + systemctl restart archipelago
  • core/archipelago/src/main.rs: startup task invokes verifier concurrently with server

v1.7.40-alpha — 2026-04-22

Proper fix for the 500 error. Fixed the v1.7.38/39 tarball-perms bug at its source — staging dir is now explicitly chmod 755 before tar; --mode=u=rwX,go=rX normalizes archive perms; pre-ship assertion aborts release if tar tvzf | head -1 isn't drwxr-xr-x.

Changes:

  • scripts/create-release-manifest.sh: pre-tar chmod + tar --mode flag + post-tar verify
  • Everything from .38 + .39 still in place (onboarding auto-heal, silent logins, app purge, AIUI in tarball)

v1.7.39-alpha — 2026-04-22

Hotfix attempt for v1.7.38's nginx 500 (didn't fully work — still shipped broken tarball perms). Added startup self-heal chmod in main.rs and post-extract chmod in update.rs OTA applier.

v1.7.38-alpha — 2026-04-22

Onboarding auto-heal + silent logins + App Store trim.

Changes:

  • auth.rs: is_onboarding_complete() auto-heals from setup_complete + password_hash (prevents clear-cache → onboarding wizard bug)
  • useOnboarding: tri-state — backend-unreachable no longer defaults to /onboarding/intro
  • Login sounds gated by isFirstInstallPhase() — silent after onboarding, typing sounds unaffected
  • Removed FIPS app, Nostr Relay, Nostr VPN, Routstr, Penpot from catalog + Rust + docker + icons
  • Deleted 15 image versions from tx1138, .168, gitea-local registries
  • AIUI baked into release tarball via demo/aiui/
  • prebuild hook syncs app-catalog/catalog.jsonpublic/catalog.json

(Shipped with tarball-perms bug; fleet had to be healed before v1.7.40.)

v1.7.37-alpha — 2026-04-22

Bitcoin Core install fixes + dynamic node UI + full-archive default.

  • Bitcoin Core passes explicit -rpcbind/-rpcallowip/etc. CLI args so vanilla image exposes RPC
  • Split bitcoin-core from bitcoin-knots in backend AppMetadata
  • bitcoin-ui auto-detects Core vs. Knots from subversion, swaps branding at runtime
  • Storage (Full Archive · X GB / Pruned) indicator on dashboard
  • Node Settings modal shows real values (network, storage, txindex, ZMQ, RPC port)
  • Pull fallback to docker.io when no mirror carries the image
  • Removed prune=550 hardcode — full archive default

Key docs


How to resume

  1. Check fleet mirrors are all live: curl -sS https://git.tx1138.com/lfg2025/archy/raw/branch/main/releases/manifest.json | jq .version
  2. Read bulletproof-containers.md for the current plan
  3. Check task list (/list or via Claude Code) for the in-flight release
  4. Latest in-flight work: v1.7.41 deploying to .228 for test; will ship to all 4 mirrors once verified