lfg2025/archy

Files

archipelago f9fef8d2cc docs(status): record rounds 3-5 + config migration + changelog as shipped

Adds a new top section to STATUS.md covering v1.7.43-alpha:

- Round 3: phase-based install progress bar
- Round 4: post-install scanner kick for instant Launch button
- Round 5: .23 VPS retirement, .168 promoted to Server 1
- Config migration: auto-purge .23 from saved registry/mirror JSONs
- Changelog: new v1.7.43-alpha entry in AccountInfoSection

All 5 commits, deployment md5, verification notes, and git remote
cleanup captured. Round 2 rollback command still valid for the full
stack since backups predate every round in this session.

2026-04-23 09:09:02 -04:00

56 KiB

Raw Permalink Blame History

RESUME HERE — Rust orchestrator migration

Updated: 2026-04-23 (Install UX polish: phase-based progress bar, post-install scanner kick for instant Launch button, .23 VPS retired with auto-purge migration, frontend/backend deployed to .228 as v1.7.43-alpha.)

To resume this work, SSH into the ThinkPad and run opencode from ~/Projects/archy/. Or work from the laptop via the SSHFS mount at ~/mnt/archy-thinkpad/.

✅ INSTALL UX POLISH + .23 RETIREMENT — SHIPPED (v1.7.43-alpha)

Rounds 3–5 + config migration + changelog (2026-04-23) — 5 commits on main (unpushed per user mirror protocol):

8cc84ebc feat(install): phase-based progress bar replaces unparseable pull bytes — podman pull emits zero parseable progress when stderr is piped (no TTY), so the legacy byte-counting regex never matched. Replaced with 7 phase-based levels: Preparing (5%) → PullingImage (20%) → CreatingContainer (70%) → StartingContainer (80%) → WaitingHealthy (88%) → PostInstall (95%) → Done (100%). UI maps phases to fixed % and only advances forward (Math.max). Final phase label renamed from "Running post-install…" to "Finalizing…" after user feedback that it read like a regression to the install step.
f86d86c3 fix(install): kick scanner post-install so Launch button appears immediately — scan runs every 60s; post-install the state flipped to Running but the skeletal install-time manifest (interfaces: None) persisted until next scan, so canLaunch(pkg) returned false for up to a minute. Added scan_kick: Arc<Notify> + scan_tick: Arc<watch::Sender<u64>> on RpcHandler. Scan loop uses tokio::select! between the 60s interval and the notify. New kick_scanner_and_wait helper (2s timeout) called in install/update success paths BEFORE writing Running, so a fresh manifest lands first. Merge during Installing/Updating uses merge_preserving_transitional (keeps state, takes fresh manifest).
22052325 chore: retire .23 VPS mirror, promote .168 OVH to primary — dropped DEFAULT_TERTIARY_MIRROR_URL, promoted .168 to DEFAULT_SECONDARY_MIRROR_URL as "Server 1 (OVH)". 2-entry default registry (.168 priority 0, tx1138 priority 10). Trusted-registry allowlist, catalog fallback, installer ISO registries, marketplaceData.ts REGISTRY, image-versions.sh all updated. Tests updated for new default counts (registry 3→2, mirror 3→2). URL-parser fixture tests in update.rs retain .23 strings intentionally — they exercise string-parsing logic, not policy.
0ee16820 fix(config): auto-purge decommissioned .23 VPS from saved registry/mirror configs — load_mirrors/load_registries normally only ADD missing defaults (explicit removals stick, by design). Existing nodes have .23 baked into their saved update-mirrors.json + config/registries.json and would pay timeouts forever against a dead host. Added targeted one-time migration in both loaders: .retain(|m| !m.url.contains("23.182.128.160")) before the defaults-merge step. Narrow-scope exception to the stickiness rule, documented in-code. Triggers lazily on next load (install RPC, update RPC, Settings UI open).
008da477 docs(changelog): add v1.7.43-alpha entry covering async lifecycle + .23 retirement — 4 release-note bullets in AccountInfoSection.vue describing async-spawn, phase progress, scanner kick, and .23 retirement from the operator's perspective. Historical "Server 3 (OVH)" entries in older changelog blocks left intact — they describe what shipped at the time.

Deployed to .228:

Backend binary md5 d2b619949f19815faaeab10429e36ba0 at /usr/local/bin/archipelago.
Frontend at /opt/archipelago/web-ui/ (includes marketplaceData.ts .168 update + v1.7.43-alpha changelog entry). Deployed bundle verified: .168 present in Settings-*.js + Marketplace-*.js, .23 absent from all assets.
/var/lib/archipelago/update-mirrors.json + config/registries.json were manually deleted + regenerated with new defaults during Round 5 verification; migration code will handle any other node on first load.
Rollback targets from Round 2 still valid: /usr/local/bin/archipelago.bak-pre-async-install + /opt/archipelago/web-ui.bak-pre-async-install/.

Git remotes cleaned on .116 (working-copy change only, not in any commit):

git remote remove gitea-vps (dropped the .23 Gitea remote).
git remote set-url --delete --push origin http://.../23.182.128.160:3000/... (dropped .23 from origin multi-push alias).
Remaining push targets: tx1138 (canonical), gitea-local (localhost Gitea), gitea-vps2 (.168 OVH).

Rollback Rounds 3–5 (same command as Round 2 — backups predate all of this):

ssh archy228 'sudo cp -a /usr/local/bin/archipelago.bak-pre-async-install /usr/local/bin/archipelago && sudo rsync -a --delete /opt/archipelago/web-ui.bak-pre-async-install/ /opt/archipelago/web-ui/ && sudo systemctl restart archipelago && sudo systemctl reload nginx'

✅ ASYNC-SPAWN LIFECYCLE FIX — SHIPPED (Stop/Start/Restart + Install/Uninstall/Update)

Round 2 (2026-04-23, install/uninstall/update) — 3 commits on main:

2d5b859e feat(rpc): async-spawn install/uninstall/update lifecycle — new api/rpc/package/async_lifecycle.rs with spawn_package_install, spawn_package_uninstall, spawn_package_update. Dispatcher + handler thread self: Arc<Self> so spawned tasks own their Arc. Install/update Ok arms explicitly set Running because merge_preserving_transitional refuses to let the scanner overwrite Installing/Updating. Removed redundant inner "already updating" guard in update.rs. Transient install entry uses empty icon (see commit 3 rationale).
0733ac40 fix(ui): shorten install/uninstall/update timeouts for async RPCs — drop 11m/45m timeouts to 15s across rpc-client.ts, stores/server.ts, and the 5 direct call sites in Marketplace.vue, Discover.vue, MarketplaceAppDetails.vue. Return types updated to { status, package_id }.
e471ef75 fix(rpc): empty icon in transient install entry to avoid broken-image flicker — progress.rs::create_installing_entry no longer hardcodes /assets/img/app-icons/<id>.png. About half of bundled apps use .svg/.webp icons; the frontend's fallback chain (backend_icon || curated.icon || placeholder) now lands on the correct curated extension.

Deployed to .228 (binary md5 f66857b3b8b3640c8cac8bd25fe508ec at /usr/local/bin/archipelago, backup at /usr/local/bin/archipelago.bak-pre-async-install; frontend at /opt/archipelago/web-ui/, backup at /opt/archipelago/web-ui.bak-pre-async-install/). User confirmed: uninstall fast and responsive, install of LND + SearXNG clean, icon flicker fixed.

Known out-of-scope issue: Vaultwarden container itself exits immediately on start with an internal error. The async wrapper correctly detects this via post-start exit verification and removes the state entry. Needs separate vaultwarden container-config investigation.

Rollback Round 2 (if ever needed):

ssh archy228 'sudo cp -a /usr/local/bin/archipelago.bak-pre-async-install /usr/local/bin/archipelago && sudo rsync -a --delete /opt/archipelago/web-ui.bak-pre-async-install/ /opt/archipelago/web-ui/ && sudo systemctl restart archipelago && sudo systemctl reload nginx'

Round 1 (Stop/Start/Restart) — 4 commits on main (unpushed per user mirror protocol):

44cd5eef feat(rpc): spawn_transitional helper for async lifecycle ops — new api/rpc/transitional.rs with Op::{Stop,Start,Restart} and RpcHandler::spawn_transitional / flip_to_transitional / set_state helpers. install_log re-exported so sibling modules can use it.
19a99ca9 fix(rpc): async container stop/start/restart; widen state mapping — container.rs start/stop rewritten + restart added; container-list now emits all transitional variants instead of falling back to "unknown". dispatcher.rs registers container-restart. package/runtime.rs mirrored with do_package_* helpers inside tokio::spawn and revert-on-error.
6712810b fix(state): preserve transitional state across container scans — server.rs scan merge now keeps transitional states while taking fresh observability fields; 1200s stuck-timeout escape hatch via transitional_since: HashMap<String, Instant>. Three passing server::merge_tests.
9ce28f08 fix(ui): single-button lifecycle control with transitional labels — ContainerApps.vue and ContainerAppDetails.vue use a single primary button driven by getAppVisualState(). Dashboard now routes through container-start/container-stop (the async RPCs) instead of the legacy synchronous bundled-app-* path. ContainerStatus.vue widened to render all new variants.

Deployed to .228 (ThinkPad demo device):

Binary at /usr/local/bin/archipelago (md5 de86b63f74c7e6fe6e555ffe30b86b4f), backup at /usr/local/bin/archipelago.bak-pre-async-stop.
Frontend at /opt/archipelago/web-ui/, backup at /opt/archipelago/web-ui.bak-pre-async-stop/.
Release build took 3m56s on .116. Deploy via scp + atomic install -m 755 + systemctl restart archipelago. nginx -t + systemctl reload nginx for frontend.

Manual verification: user clicked Stop on LND in the dashboard. Button flipped to Stopping… instantly, held for the full graceful-stop window, transitioned to Start when podman stop completed. No mid-flight revert to Running. User sign-off: "absolutely beautiful".

Rollback (if ever needed):

ssh archy228 'sudo cp /usr/local/bin/archipelago.bak-pre-async-stop /usr/local/bin/archipelago && sudo rsync -a --delete /opt/archipelago/web-ui.bak-pre-async-stop/ /opt/archipelago/web-ui/ && sudo systemctl restart archipelago && sudo systemctl reload nginx'

Follow-ups to consider

Chaos matrix / Step 11 — the original next-step gated behind this fix. Now unblocked.
bundled-app-start / bundled-app-stop — still synchronous in the backend. Dashboard no longer calls them, but the RPC methods remain for any external caller. Decide: deprecate, or mirror the async-spawn treatment for parity.
transitional_since persistence — currently in-memory only, so a backend restart mid-stop loses the timeout anchor. Acceptable for now (scan loop re-observes live podman state and reconciles), but worth revisiting if crash-recovery stories tighten.
Test regressions inventory — the full cargo test -p archipelago run on .116 shows 22 pre-existing failures in unrelated modules (mesh/wallet/credentials/avatar/session/transport/update-mirrors/fips/identity_manager/image_versions). Unrelated to this work but tech debt. Log at /tmp/cargo-test-all.log on .116.
Amend STATUS.md's older "NEXT SESSION — START HERE" section (below) — it is now stale. Left in place for historical reference of how the fix was designed; delete on the next pass if it gets confusing.

⚡ NEXT SESSION — START HERE (historical — fix above is now shipped)

Goal: implement async-spawn lifecycle fix so the dashboard never shows a frozen spinner again. User mandate: "best server containers in the world". Do not ship the chaos matrix (Step 11) until this lands and manual LND stop verifies instant RPC + live Stopping… label.

How to work on this repo (SSH + SSHFS setup)

You are likely running on the laptop (macOS). The repo lives on the ThinkPad (.116). There are two access paths, use both in parallel:

SSHFS mount at ~/mnt/archy-thinkpad/ — for all file ops (read/edit/write/glob/grep).
Direct SSH — for everything that isn't file ops: git, cargo, npm, systemctl, running the server, tailing logs.

See the "FUSE / SSHFS development loop" section below for the full mount lifecycle — that's the thing that makes this dev setup work, and it will break periodically.

FUSE / SSHFS development loop

Why this exists: editing the repo directly on the ThinkPad over raw SSH means no IDE, no tool-native file reads, no glob/grep speed. SSHFS mounts the remote filesystem as a local directory so OpenCode's file tools work transparently. But SSHFS is a leaky abstraction — know the gotchas or you'll waste hours.

Stack (macOS laptop):

macFUSE — kernel extension providing FUSE on macOS. Install via brew install --cask macfuse (requires reboot + security approval in System Settings the first time).
sshfs — userspace mount tool. Install via brew install gromgit/fuse/sshfs-mac (the homebrew core sshfs was removed; use this tap).
Verify: which sshfs → /opt/homebrew/bin/sshfs, sshfs --version → SSHFS version 2.10 / FUSE library version 2.9.9.

Actual mount command currently running (verified from ps):

sshfs archy:Projects/archy /Users/dorian/mnt/archy-thinkpad \
  -o reconnect,ServerAliveInterval=15,ServerAliveCountMax=3,volname=archy-thinkpad

Breakdown:

archy:Projects/archy — remote path via the archy SSH alias (uses ~/.ssh/archy_opencode, no password prompt).
~/mnt/archy-thinkpad — local mount point. Create once: mkdir -p ~/mnt/archy-thinkpad.
reconnect — sshfs auto-reconnects if the TCP session drops (WiFi flap, laptop sleep). Without this, the mount turns into a zombie immediately.
ServerAliveInterval=15 — sends a keepalive every 15s.
ServerAliveCountMax=3 — disconnect after 3 missed keepalives (45s). Tune up if your network is flaky.
volname=archy-thinkpad — Finder display name.

Check mount health:

mount | grep archy-thinkpad
# should print: archy:Projects/archy on /Users/dorian/mnt/archy-thinkpad (macfuse, nodev, nosuid, synchronous, mounted by dorian)

ls ~/mnt/archy-thinkpad/ | head
# should list repo contents fast (<1s). If it hangs, mount is stale.

Recovery when the mount hangs / goes stale (this WILL happen — laptop sleeps, WiFi drops, ThinkPad reboots):

# 1. Force-unmount (macOS — `umount` alone often fails on a hung FUSE mount)
sudo diskutil unmount force ~/mnt/archy-thinkpad
# fallback if diskutil can't see it:
sudo umount -f ~/mnt/archy-thinkpad

# 2. Kill any zombie sshfs process
pkill -f "sshfs archy:Projects/archy"

# 3. Remount
sshfs archy:Projects/archy ~/mnt/archy-thinkpad \
  -o reconnect,ServerAliveInterval=15,ServerAliveCountMax=3,volname=archy-thinkpad

# 4. Verify
ls ~/mnt/archy-thinkpad/ | head

If the mount point itself got wedged (ls: /Users/dorian/mnt/archy-thinkpad: Device not configured), the sequence above still works — macFUSE garbage-collects the inode after the force-unmount.

When to use which path (rules, not suggestions):

Operation	Use	Why
`read` / `edit` / `write`	SSHFS mount	OpenCode tools want local paths
`glob` / `grep`	SSHFS mount	Local FS traversal is fine; remote would need rg over SSH
Reading many files	SSHFS mount	Each read is a round-trip but parallelizable
`git status` / `git diff` / `git log`	SSH	Git over FUSE is painfully slow (lots of stat calls)
`git add` / `git commit`	SSH	Same — commit times grow linearly with tree size on FUSE
`cargo check` / `cargo test` / `cargo build`	SSH	Compiling over FUSE would take hours; cargo's incremental stat pattern destroys FUSE performance
`npm install` / `npm run build`	SSH	Same reason — massive file churn
Running the server / tailing journal	SSH	Service lives on .116
Deploying to .228	SSH from .116	SCP from ThinkPad; laptop isn't in the critical path

Don't do this (will bite you):

cargo build from the mount — will try to write target/ over FUSE, gets orders of magnitude slower, may hang.
rsync without --exclude="._*" — macOS writes AppleDouble metadata files, they leak to the remote as ._* siblings of every real file. .gitignore already excludes them (commit 13858842), but they clutter the tree.
Writing big binary files via the mount — use scp over SSH instead.
Relying on file-change-watcher tools (watchman, chokidar) — they get confused by FUSE event semantics.

Editing workflow in a typical session:

Laptop: OpenCode reads a file via /Users/dorian/mnt/archy-thinkpad/.... FUSE fetches it over SSH, caches briefly.
Laptop: OpenCode edits the file — FUSE writes the new bytes back to .116 immediately (synchronous mount).
Laptop: ssh archy "cd ~/Projects/archy && ~/.cargo/bin/cargo check -p archipelago" — runs on the real filesystem on .116, sees the edit.
Laptop: ssh archy "cd ~/Projects/archy && git diff path/to/file" — confirms the edit landed.
Laptop: ssh archy "cd ~/Projects/archy && git add path/to/file && git commit -m '...'" — commit from .116.

The SSHFS mount and the SSH shell are pointing at the same inodes — edits via the mount are instantly visible to cargo/git over SSH. There's no "sync" step.

Cache caveat: macFUSE caches attributes briefly (default ~1s). If you write via SSH and read via the mount within that window, you may see stale metadata. The mount's synchronous flag (visible in mount output) minimizes but doesn't eliminate this. If you get a weird diff between what SSH and the mount report, re-read after a second, or stat --file-system ~/mnt/archy-thinkpad/<file> to force a refresh.

Direct SSH access (use when FUSE isn't the right tool):

ssh archy → archipelago@192.168.1.116 using ~/.ssh/archy_opencode
ssh archy228 → archipelago@192.168.1.228 using ~/.ssh/archy_opencode
Full host form also works: ssh archipelago@192.168.1.116 / ssh archipelago@192.168.1.228 (same key resolves via IdentitiesOnly).

SSH keys — what's where

Laptop ~/.ssh/ (macOS, user dorian):

File	Purpose
`archy_opencode` / `.pub`	Primary key for this project. Unlocks both `archy` (.116) and `archy228` (.228). Created 2026-04-22 specifically for OpenCode work.
`archipelago-deploy` / `.pub`	Older archipelago deploy key. Not needed for current work.
`id_ed25519` / `.pub`	Personal default key. Not used by archy/archy228 configs (`IdentitiesOnly yes` forces `archy_opencode`).
`id_ed25519_angor` / `.pub`	Angor project. Unrelated.
`id_ed25519_start9` / `.pub`	Start9 project. Unrelated.
`vps-ci-setup` / `.pub`	VPS CI. Unrelated.
`config`	Host aliases (shown above)

.116 /home/archipelago/.ssh/:

File	Purpose
`authorized_keys`	Accepts: laptop's `archy_opencode.pub` + 3 other keys (4 lines total).
`id_ed25519` / `.pub`	.116's OWN identity key. This is what lets `.116 → .228` work passwordless.
`archipelago-deploy`	Symlink → `id_ed25519` (legacy alias).
`id_ed25519_vps168` / `.pub`	For SSH to `146.59.87.168` (VPS). Unrelated to this work.
`config`	Host entry for the VPS only.

.228 /home/archipelago/.ssh/:

File	Purpose
`authorized_keys`	Accepts: laptop's `archy_opencode.pub` + .116's `id_ed25519.pub` + 2 others (4 lines total).
(no `id_ed25519`)	.228 has no outbound key — it's a terminal node. Don't try to `ssh` from .228 to anywhere.

Connectivity matrix (all verified 2026-04-23):

From → To	Works passwordless	Via
Laptop → .116	✅	`archy_opencode`
Laptop → .228	✅	`archy_opencode`
.116 → .228	✅	.116's `id_ed25519`
.228 → anywhere	❌	no outbound key (by design)

Sudo — verified state

.116 (dev ThinkPad):

User archipelago is in sudo group.
Sudo password required: ThisIsWeb54321@
Sudoers drop-ins present: /etc/sudoers.d/archipelago-ci, /etc/sudoers.d/archipelago-wg (scope-limited NOPASSWD for specific CI/wg commands — not full NOPASSWD).
For most dev work you don't need sudo on .116.

.228 (prod kiosk):

User archipelago has full passwordless sudo via /etc/sudoers.d/archipelago containing archipelago ALL=(ALL) NOPASSWD:ALL.
User is also in sudo group.
Sudo password (if ever prompted, shouldn't be): archipelago
Dashboard password: password123

Cargo / npm / paths

Cargo PATH gotcha: non-interactive SSH login has no cargo in PATH. Always use ~/.cargo/bin/cargo over SSH.
- Example: ssh archy '~/.cargo/bin/cargo check -p archipelago' --workdir ~/Projects/archy/core
- Or cd first: ssh archy 'cd ~/Projects/archy && ~/.cargo/bin/cargo check -p archipelago'

Long cargo builds (>2 min Bash tool timeout): launch detached and poll the log:

ssh archy 'cd ~/Projects/archy && nohup ~/.cargo/bin/cargo build --release -p archipelago > /tmp/cargo-build.log 2>&1 < /dev/null & disown'
ssh archy 'tail -30 /tmp/cargo-build.log'
ssh archy 'pgrep -a cargo'   # to check if still running

npm / frontend lives at ~/Projects/archy/neode-ui/ on .116 (also accessible via laptop mount at ~/mnt/archy-thinkpad/neode-ui/). Node is on interactive PATH; for scripted SSH, source ~/.nvm/nvm.sh && nvm use or call the absolute path if nvm is used.
Repo on .116: ~/Projects/archy/ (Cargo workspace at core/Cargo.toml).
Web root on .228: check /etc/nginx/sites-enabled/ for the live path; historically /var/lib/archipelago/web-ui/ or /opt/archipelago/web-ui/.

Deploying new server binary to .228

# 1. Build on .116 (detached — takes ~3-5 min for release)
ssh archy 'cd ~/Projects/archy && nohup ~/.cargo/bin/cargo build --release -p archipelago > /tmp/cargo-build.log 2>&1 < /dev/null & disown'
# wait / tail log until "Finished `release` profile"

# 2. SCP .116 → .228 (uses .116's id_ed25519 → .228's authorized_keys, passwordless)
ssh archy 'scp ~/Projects/archy/core/target/release/archipelago archipelago@192.168.1.228:/tmp/archipelago.new'

# 3. Atomic swap on .228 with backup
ssh archy228 'sudo cp /usr/local/bin/archipelago /usr/local/bin/archipelago.bak-pre-async-stop && sudo mv /tmp/archipelago.new /usr/local/bin/archipelago && sudo chmod +x /usr/local/bin/archipelago && sudo systemctl restart archipelago'

# 4. Verify
ssh archy228 'systemctl status archipelago --no-pager | head -20 && sudo journalctl -u archipelago -n 50 --no-pager'

Git workflow

Branch: main on .116, currently 22 commits ahead of tx1138/main.
Remote tx1138 exists but do NOT push — user mirrors to 4 Gitea remotes personally after reviewing.
Atomic commits, one logical change per commit. Conventional Commits format (feat:, fix:, docs:, refactor:, chore:, test:, perf:).
Never --amend unless the commit you're amending was created in this session AND has not been pushed. Safer: new commit.
Never --force push. Never modify git config.
If pre-commit hooks fail, create a NEW commit with the fix — don't --amend after a failed commit.

Other

Full destructive latitude on both nodes. Announce multi-hour ops (OTA, full rebuild, apt upgrade). Don't ask for routine stop/start/rebuild permission.
No ship pressure. Do it properly.
Use question tool for ambiguous decisions (don't guess user intent on design choices).
Keep docs/STATUS.md fresh between sessions — it IS the session handoff.

Hosts reference (quick)

Host	IP	SSH alias	Role	Dashboard	Sudo
`archy` (ThinkPad X250)	192.168.1.116	`ssh archy`	dev host, Debian 13	`archipelago`	`ThisIsWeb54321@`
`archy228` (HP ProDesk)	192.168.1.228	`ssh archy228`	prod kiosk, Rust orchestrator	`password123`	NOPASSWD (fallback `archipelago`)

Bug being fixed

Dashboard sequence when user clicks Stop LND:

UI collapses Start/Stop buttons to single spinner-button ("Stopping…") via loadingApps.add('lnd').
Frontend calls container-stop RPC. Server runs podman stop -t 330 lnd synchronously inside the RPC handler (via orchestrator.stop()). RPC blocks up to 5.5 min for LND (330s timeout + overhead).
Meanwhile the 30-second package-scan loop in server.rs:scan_and_update_packages keeps running. It rebuilds PackageDataEntry from podman inspect — podman still reports running (stop hasn't completed) — and blindly overwrites the store entry at server.rs:854.
container-list RPC reads state_manager snapshot → returns state = "running".
Frontend polling sees running → getAppState() returns 'running' → the two-button (Start | Stop) block re-renders → the transitional button disappears → UI looks like the stop silently failed.
Eventually podman stop finishes → next scan → state flips to Stopped → buttons change again.

Net visible bug: button spins briefly, reverts to Running, then several minutes later suddenly shows Stopped. User rightly calls this "out of sync and confusing".

Decisions already locked in (do not re-ask)

Full scope fix (not minimal hotfix). User chose "Go full scope, do it right".
Async-spawn lives in the RPC layer, not in the ContainerOrchestrator trait. Trait stays synchronous so the reconciler, boot flow, unit tests, and the chaos harness retain deterministic behaviour.
PackageState already has Stopping/Starting/Restarting/Installing/Updating/Removing variants — enum at core/archipelago/src/data_model.rs:107-124. No schema change needed.
UI collapses to one full-width button with spinner during every transitional state. Labels: Start / Stop / Starting… / Stopping… / Restarting… / Installing… / Updating… / Removing… / Install (when not-installed).
Helper API shape: RpcHandler::spawn_transitional(op: Op, app_id: String) where Op is an enum {Stop, Start, Restart}. Helper dispatches to orchestrator.stop/start/restart internally, knows each op's transitional+final states, handles error → revert + install_log().
mark_user_stopped must run BEFORE the spawn (preserves ordering the crash recovery layer depends on — see runtime.rs:145-148).

Implementation order (4 commits, local only)

Commit 1 — feat(rpc): spawn_transitional helper for async lifecycle ops

New file: core/archipelago/src/api/rpc/transitional.rs (or extend container.rs; prefer new file for cohesion with future stacks/package variants)
enum Op { Stop, Start, Restart } with transitional_state(), final_state_on_success(), log_prefix(), and async dispatch(&orch, &app_id) method
impl RpcHandler { pub(super) async fn spawn_transitional(&self, op: Op, app_id: String) -> Result<()> }
- Capture Arc<dyn ContainerOrchestrator> + Arc<StateManager> clones
- Set transitional state via state_manager.update_data() (if entry exists; skip if not — Start on never-installed shouldn't create an entry)
- tokio::spawn(async move { ... })
- Inside spawn: install_log("{LOG_PREFIX}: {app_id}"), op.dispatch(&orch, &app_id).await, on success set final state, on error log + install_log("{LOG_PREFIX} FAIL: …") + revert state to previous (cache pre-transition state in a local)
- Return Ok(()) immediately after spawn

Commit 2 — fix(rpc): async container stop/start/restart; widen state mapping

api/rpc/container.rs:85-107 — rewrite handle_container_stop body: validate_app_id, mark_user_stopped, spawn_transitional(Op::Stop, app_id.to_string()).await?, return Ok(json!({ "status": "stopping" }))
api/rpc/container.rs:61-83 — rewrite handle_container_start: clear_user_stopped, spawn_transitional(Op::Start, …), return { "status": "starting" }
Add handle_container_restart (currently missing in container.rs — only exists as package.restart at runtime.rs:176-242). Register RPC route name container-restart. Add matching frontend client method in container-client.ts.
api/rpc/container.rs:148-154 — widen the container-list state mapping: add arms for Stopping → "stopping", Starting → "starting", Restarting → "restarting", Installing → "installing", Updating → "updating", Removing → "removing", Installed → "installed", CreatingBackup/RestoringBackup/BackingUp → their kebab-case strings. No more "unknown" fallback unless the variant is genuinely unknown.
Mirror same spawn treatment in api/rpc/package/runtime.rs: handle_package_start (L28-119), handle_package_stop (L122-173), handle_package_restart (L176-242). Keep the existing verification loops (post-start exit-check at L82-117; restart stop+start fallback at L215-235) inside the spawned future, not in the RPC body.

Commit 3 — fix(state): preserve transitional state across container scans

server.rs:847-857 — in the merge loop, before the merged.insert(id.clone(), pkg.clone()) overwrite, check merged.get(id).state and skip overwrite if it's transitional: matches!(existing.state, Installing | Stopping | Starting | Restarting | Updating | Removing | CreatingBackup | RestoringBackup | BackingUp)
Still allow non-state fields (lan_address, health, ports) to update. Simplest: when existing is transitional, keep existing.state but merge updated fields from pkg. Write a tiny helper merge_preserving_transitional(existing, fresh) -> PackageDataEntry.
Unit test: construct existing.state = Stopping, fresh.state = Running, assert merged.state stays Stopping.
Also check: Is there a timeout escape hatch? If Stopping is set and podman actually finishes but the spawn died before writing the final state (process crash, panic), the entry will be stuck Stopping forever. Mitigation: track a transitional_since: Instant in the entry (not persisted, just in-memory side table on StateManager), and if > 2× the stop timeout has elapsed, allow podman scan state to override. Scope for this commit or follow-up — lean toward: include it, because fleet reliability matters.

Commit 4 — fix(ui): single-button lifecycle control with transitional labels

neode-ui/src/api/container-client.ts — extend ContainerStatus.state union to: 'created' | 'running' | 'stopped' | 'exited' | 'paused' | 'unknown' | 'stopping' | 'starting' | 'restarting' | 'installing' | 'updating' | 'removing' | 'installed'. Add restartContainer(appId) method calling container-restart.
neode-ui/src/stores/container.ts — add computed getAppVisualState(appId) that returns one of: 'not-installed' | 'running' | 'stopped' | 'starting' | 'stopping' | 'restarting' | 'installing' | 'updating' | 'removing'. Maps exited→stopped, created→stopped, paused→stopped, installed→stopped. Add restartContainer(appId) action (sets loadingApps for request dedup, calls client, does NOT fetchContainers immediately because server will broadcast state; a final fetchContainers after a short delay can backstop if WebSocket push is absent).

neode-ui/src/views/ContainerApps.vue:85-136 — replace the two-button conditional with a single full-width button bound to getAppVisualState(app.id). Table:

visual state	click action	label	spinner	disabled
`not-installed`	installApp	Install	no	no
`running`	stopContainer	Stop	no	no
`stopped`	startContainer	Start	no	no
`starting`	—	Starting…	yes	yes
`stopping`	—	Stopping…	yes	yes
`restarting`	—	Restarting…	yes	yes
`installing`	—	Installing…	yes	yes
`updating`	—	Updating…	yes	yes
`removing`	—	Removing…	yes	yes

Add a separate Restart button next to the primary one when state is running, calling new restartContainer action. Restart button hides while transitional.

neode-ui/src/views/ContainerAppDetails.vue:83 (and full stop/start button blocks around L220, L232) — mirror the same single-button pattern.
Also audit line 239 of ContainerApps.vue (some((app) => store.getAppState(app.id) === 'created')) and the logic around lines 276, 295, 309, 312 — make sure they use getAppVisualState where appropriate.

Verification gates (do not skip)

~/.cargo/bin/cargo check -p archipelago on .116 via SSH
~/.cargo/bin/cargo test -p archipelago on .116 via SSH — at least the new merge helper test must pass
Build release binary on .116: nohup ~/.cargo/bin/cargo build --release -p archipelago > /tmp/cargo-build.log 2>&1 < /dev/null & disown. Poll until done.
SCP binary to .228 /usr/local/bin/archipelago, back up prior to /usr/local/bin/archipelago.bak-pre-async-stop. sudo systemctl restart archipelago on .228.
Manual LND stop test on .228:
- Open dashboard, confirm LND is Running (first: ssh archipelago@192.168.1.228 'podman start lnd' — LND is currently Exited(0) from the demo)
- Click Stop
- Expected: button immediately becomes "Stopping…" with spinner (RPC returns <1s)
- Dashboard should stay on "Stopping…" for ~5 min
- Then flip to "Start" button with label "Start"
- At no point should it revert to "Running" mid-stop
Same test with Bitcoin Core stop (longest timeout, 600s)
Frontend build: cd ~/Projects/archy/neode-ui && npm run type-check && npm run build. Rsync dist/ to archipelago@192.168.1.228:/var/lib/archipelago/web-ui/ (or wherever the active web root is — check /etc/nginx on .228 first).
Then and only then: resume chaos matrix. First recover LND/ElectrumX via UI (great end-to-end test of the new async Start path), then run smoke → full 32-case matrix.

Key files (exact lines of interest)

core/archipelago/src/api/rpc/container.rs:85-107 — handle_container_stop (blocking — target of fix)
core/archipelago/src/api/rpc/container.rs:61-83 — handle_container_start
core/archipelago/src/api/rpc/container.rs:148-154 — narrow state mapping (drops transitional → "unknown")
core/archipelago/src/api/rpc/package/runtime.rs:11-24 — stop_timeout_secs table (reference, unchanged)
core/archipelago/src/api/rpc/package/runtime.rs:122-173 — handle_package_stop (also blocking, mirror treatment)
core/archipelago/src/api/rpc/package/runtime.rs:28-119 — handle_package_start
core/archipelago/src/api/rpc/package/runtime.rs:176-242 — handle_package_restart
core/archipelago/src/api/rpc/package/progress.rs — existing broadcast pattern to mirror (set_install_progress, set_uninstall_stage)
core/archipelago/src/api/rpc/mod.rs:62-100 — RpcHandler struct (already holds Arc<dyn ContainerOrchestrator> + state_manager)
core/archipelago/src/server.rs:812-857 — scan_and_update_packages (merge loop at L850-857 is where transitional-state clobber happens)
core/archipelago/src/container/docker_packages.rs:636-663 — convert_state + package_state_str (read-only reference, no change)
core/archipelago/src/container/traits.rs — ContainerOrchestrator trait (stays synchronous, do not change)
core/archipelago/src/crash_recovery.rs — mark_user_stopped / clear_user_stopped (call order preserved)
core/archipelago/src/data_model.rs:107-124 — PackageState enum (no change — all variants exist)
neode-ui/src/api/container-client.ts — ContainerStatus type + RPC methods (extend)
neode-ui/src/stores/container.ts:93-312 — Pinia store (add getAppVisualState, add restartContainer action)
neode-ui/src/views/ContainerApps.vue:85-136, 239, 276, 295, 309-312, 383 — two-button block + state reads
neode-ui/src/views/ContainerAppDetails.vue:83, 220, 232 — details page Stop/Start

Chaos harness (not in repo — lives on .116)

archipelago@192.168.1.116:~/ui-chaos/ — deployed, playwright + deps installed, smoke test for bitcoin-core passes (2.1 min). LND/ElectrumX/bitcoin-ui smoke tests not yet run (blocked on the async-stop fix landing; LND currently Exited on .228 from the demo).
/tmp/chaos/ on laptop — canonical source for rsync to .116.
Run: cd ~/ui-chaos && npx playwright test tests/<spec>
Target: 32 cases = 4 core containers × 8 scenarios (install-fresh, graceful-stop, sigkill, rm-container, oom-kill, rm-image, restart-service, network-partition).
Uses SSH+Playwright hybrid per design; includes the bash -lc '<escaped>' single-quote fix for ssh argv flattening and JSON-parsed podman inspect instead of Go templates.

Pre-existing bugs still deferred (do not fix until Stop UX lands)

archipelago --version spawns server (should be a pure CLI query)
RPC unknown-method returns generic error (should return method-not-found with the bad method name)
docker_packages.rs filters out UI containers (archy-lnd-ui, archy-electrs-ui) — some views need them visible
lnd.lan_address stale on .228
first-boot silent failure on some hardware
web-ui.failed.* scar on .228 (benign systemd unit state)
test_parse_image_versions pre-existing broken assertion — fix or #[ignore] when touching that area

Where we are

Working through the 11-step plan in rust-orchestrator-migration.md.

Step 1 — 3767c267 ContainerConfig schema with build:, ResolvedSource enum, resolve(), 10 tests
Step 2 — 34af4d9d ContainerRuntime trait gained image_exists + build_image, 4 argv tests, 25/25 pass
Step 3 — b6a04d31 ProdContainerOrchestrator (999 LOC), 16 tests all pass, not yet wired to main.rs
Step 4 — e8a59c93 ContainerOrchestrator trait, RpcHandler uses it in prod (+ 13858842 chore gitignore ._*)
Step 5 — fc39b04b BootReconciler with Arc shutdown, 4 paused-time tests pass
Step 6 — 48f08aa3 main.rs wire-up (orchestrator construction + adopt_existing + BootReconciler spawn + shutdown Notify)
Step 7 — 069bc4a5 bitcoin-ui pre-start hook + embedded nginx.conf template (8 unit tests + 1 integration test), 39/39 container:: tests pass
Step 8a — a0707f4d retire archipelago-reconcile.{service,timer} + ISO builder touchpoints, keep scripts for update.rs
Step 9 — Hot-swap on .228 verified. All three UIs (bitcoin-ui/lnd-ui/electrs-ui) installing + serving HTTP 200.
.228 dashboard bugs — ExtraHost 192.168.1.254 bug (3ee192ba) + LND macaroon permission bug (be960023). See "Post-Step 9 bug hunt" below.
Step 8b — Port remaining ~25 container creations from first-boot-containers.sh into apps/<id>/manifest.yml, then port update.rs to orchestrator (deferred, multi-day work)
Step 8c — Rename first-boot-containers.sh → first-boot-setup.sh, strip container ops, keep setup. Delete reconcile-containers.sh + container-specs.sh. Add ISO lines to copy apps/ (final one-way door, requires 8b complete)
Step 10 — Hot-swap + verify on .116 (adoption-heavy test — .116 already has all containers running)
Step 11 — Chaos matrix on both nodes (all 8 scenarios × all containers incl. bitcoin-core)

Post-Step 9 bug hunt (.228, 2026-04-23)

User reported three visible dashboard bugs after Step 9 verification:

LND — "no connect details or QR"
ElectrumX — stuck at "Building index (2 KB / ~130 GB)" for days
bitcoin-core — in scope for chaos testing

Root cause #1 (ExtraHost, commit 3ee192ba): scripts/first-boot-containers.sh computed HOST_GATEWAY from ip route show default, which returns the LAN router (e.g. 192.168.1.254), not the gateway to the host. Every container configured with --add-host=host.containers.internal:$HOST_GATEWAY was dialing the WiFi router instead of the host. LND crash-looped with dial tcp 192.168.1.254:8332: connection refused; ElectrumX's DAEMON_URL hit the same dead end; any archy-net bridge consumer of bitcoin-core's RPC was broken. Fixed by replacing the computed value with podman's magic host-gateway literal (supported since 4.4; we ship 5.4.2). Live-recreated bitcoin-core/electrumx/lnd on .228 with the corrected --add-host; LND reached chain backend; ElectrumX resumed indexing (went from 2 KB → 164.9 MB in under an hour).

Root cause #2 (macaroon permissions, commit be960023): LND's admin.macaroon lives at /var/lib/archipelago/lnd/data/chain/bitcoin/mainnet/admin.macaroon, owned by rootless-podman subordinate UID 100000, mode 640. The archipelago server runs as host UID 1000 and literally cannot read the file. Every LND RPC (getinfo, connect-info, export-channel-backup) plus the shared lnd_client() helper failed with "Failed to read LND admin macaroon". Confirmed pre-existing on .116 too (long-standing bug unrelated to Step 9). Fix: centralised the path as LND_ADMIN_MACAROON_PATH, added a read_lnd_admin_macaroon() helper in api/rpc/lnd/mod.rs that tries direct read first then falls back to sudo -n cat (mirrors the pattern already used for Tor onion hostnames). Four call sites routed through the helper. Verified on .228 — curl -k https://<host>/lnd-connect-info now returns 200 with cert + macaroon + tor_onion; dashboard QR unblocked.

Step 9 evidence (.228, 2026-04-23)

Binary: Step 9 build with 732df1b8 + ba83f9bc, scp'd to .228 as /usr/local/bin/archipelago. Old binary backed up at /usr/local/bin/archipelago.bak-pre-step9. Later replaced with macaroon-fix build (be960023); previous backed up at /usr/local/bin/archipelago.bak-pre-macaroon.
DEV_MODE override disabled (override.conf → override.conf.disabled-pre-step9).
/opt/archipelago/apps/{bitcoin-ui,electrs-ui,lnd-ui}/manifest.yml populated.
/opt/archipelago/docker/bitcoin-ui/Dockerfile replaced with the Step 7 version (no COPY nginx.conf). Old dir backed up as bitcoin-ui.bak-pre-step9.
Post-start snapshot:
- 🔗 Adopted 1 existing container(s): ["electrs-ui"] — adoption of 13h-running container worked without recreation
- 🔄 Boot reconciler started (interval: 30s) — every 30s, all three app_ids reach NoOp after the initial install pass
- bitcoin-ui nginx.conf rendered path=/var/lib/archipelago/bitcoin-ui/nginx.conf auth_hash=97af1c18 — pre-start hook fires in install_fresh
- curl localhost:8334 → HTTP 200 (bitcoin-ui), :8081 → 200 (lnd-ui), :50002 → 200 (electrs-ui)
- OCI memory limits correctly applied: bitcoin-ui=128Mi, electrs-ui=128Mi, lnd-ui=64Mi (was emitted as 0 pre-fix)

Bugs fixed this session

parse_memory_limit truncation bug (732df1b8): lowercased "128Mi" → "128mi" → trim_end_matches('m') → "128i" → f64 parse fails → None.unwrap_or(0) → OCI memory.limit:0 → systemd rejects MemoryMax=0. 6 regression tests; create_container now omits instead of emitting 0.
archipelago.service cgroup delegation missing (ba83f9bc): belt-and-braces Delegate=memory pids cpu io.
ExtraHost 192.168.1.254 (3ee192ba): see Post-Step 9 bug hunt above.
LND admin.macaroon unreadable (be960023): see Post-Step 9 bug hunt above.

Commits made this session

3ee192ba fix(first-boot): use podman host-gateway magic for host.containers.internal
be960023 fix(lnd): read admin macaroon via sudo fallback
4b8ef0a0 docs: STATUS.md through Step 9 (.228 hot-swap verified)
ba83f9bc feat(systemd): delegate cgroup controllers to archipelago.service
732df1b8 fix: parse_memory_limit accepts Ki/Mi/Gi IEC binary suffixes
a0707f4d refactor: retire archipelago-reconcile.{service,timer}  (Step 8a)
1c81a739 docs: split Step 8 into 8a/8b/8c
6e46932f docs: STATUS.md through Step 7
069bc4a5 feat: bitcoin-ui pre-start hook (Step 7)

Branch is 19 commits ahead of tx1138/main (local only — user pushes to mirrors personally).

Uncommitted state

Clean. Only untracked: tests/ (bats harness from prior session, not in scope), tmp-dump-spec.py (scratch).

Answered design questions (no need to re-ask)

UI container naming → archy-<app_id> for UIs only; existing bitcoin-knots/lnd/electrumx keep bare names
BITCOIN_RPC_AUTH injection → runtime bind-mount of nginx.conf (no build-args, no envsubst)
Reconciler interval → 30 seconds
Concurrency → per-app Mutex<()> in a DashMap
Bash scripts → split into 8a/8b/8c; 8a done, 8b/8c deferred
Step 4 extension → ContainerOrchestrator trait includes install(app_id); the manifest_path-based install RPC stays dev-only
Step 7 bitcoin-ui template → embed via include_str!, render on install + every reconcile, atomic tmp+rename to /var/lib/archipelago/bitcoin-ui/nginx.conf, bind-mount into container. RPC user hardcoded archipelago, password from /var/lib/archipelago/secrets/bitcoin-rpc-password.

Context: which host is what

Host	IP	Role	Dashboard pw	Sudo pw
`archy`	192.168.1.116	Dev ThinkPad (Lenovo X250, Debian 13). Currently running v1.7.42-alpha (DEV_MODE). Step 10 target.	archipelago	ThisIsWeb54321@
`archy228`	192.168.1.228	Kiosk HP ProDesk. Step 9 landing zone — now running Rust-orchestrator binary in prod mode.	password123	archipelago

Both are development alpha nodes — full destructive latitude, no need to ask before stop/start/rebuild.

Next action

Step 10 — Hot-swap on .116.

Unlike .228 (which tested the INSTALL path for net-new UI containers), .116 tests the ADOPTION path: it already has all three UIs and all backend containers running from prior v1.7.42-alpha runs. We want to verify the new prod orchestrator adopts every existing container without recreating or restarting them.

Steps:

Disable DEV_MODE on .116 (check if override.conf exists — /etc/systemd/system/archipelago.service.d/)
Stage the already-built binary at ~/Projects/archy/core/target/release/archipelago → /usr/local/bin/archipelago.new
Ensure /opt/archipelago/apps/{bitcoin-ui,electrs-ui,lnd-ui}/manifest.yml present (copy from repo)
Ensure /opt/archipelago/docker/bitcoin-ui/ matches the Step-7 layout (no baked nginx.conf)
Snapshot: podman ps -a --format "{{.Names}}\t{{.Status}}\t{{.CreatedAt}}" → save to /tmp/pre-step10-containers.txt
systemctl stop archipelago → install binary → systemctl start archipelago
Verify in journal: every running container appears in "Adopted N existing container(s)"; no container was recreated; all HTTP smokes still 200; BootReconciler reaches NoOp on every app_id after one pass.
If broken → restore .bak binary, re-enable DEV_MODE override.
Commit STATUS.md update.

Risk on .116: If adoption fails mid-flight, we'd lose the running v1.7.42 backend that I'm currently typing at. Keep a second SSH session open to the ThinkPad for emergency revert. The backup plan is install /usr/local/bin/archipelago.bak /usr/local/bin/archipelago && systemctl restart archipelago.

After Step 10 we are blocked on Step 8b (multi-day manifest ports) before Step 11 (chaos matrix).

Why Step 8 got split (discovered 2026-04-23)

Original plan was one commit "delete bash + edit ISO builder". But on investigation:

first-boot-containers.sh creates 30+ containers with per-container logic (wallets, DB init, rpcauth derivations, post-create health waits). The repo only has manifests for 3 (bitcoin-ui, electrs-ui, lnd-ui from Step 7). Deleting bash now = brick first-boot on fresh installs.
Script also does non-container setup: secret generation (RPC pw, DB pw, FileBrowser admin pw), UID-mapping chowns for rootless podman subuid, Tor hostnames dir, WireGuard, firewall rules, nostr-relay dir. None of this lives in the Rust orchestrator.
update.rs (OTA update RPC) invokes reconcile-containers.sh at two sites. Deleting the script breaks package updates. Porting those call sites to the orchestrator needs all containers to have manifests.
Design doc §505 updated to split 8 → 8a/8b/8c. Only 8a (delete the reconcile systemd unit + timer, BootReconciler covers) is safe to execute before we port manifests.

Archipelago — Current State, Plan, and Releases

Updated: 2026-04-22

This is the "pick this up tomorrow" page. One-stop summary of where we are, what the plan is, and what's shipped. Detailed plan lives in bulletproof-containers.md.

Current state

Fleet status

All four Gitea mirrors are synced to v1.7.40-alpha:

Mirror	Host	Status
tx1138	https://git.tx1138.com	✅ v1.7.40-alpha live
gitea-local	http://localhost:3000	✅ v1.7.40-alpha live
.160	http://23.182.128.160:3000	✅ v1.7.40-alpha live (Gitea recovered via `podman system renumber` — see below)
.168	http://146.59.87.168:3000	✅ v1.7.40-alpha live

Fleet test nodes:

Node	Version	State
.103 (dev)	1.7.40	running, being developed against
.116 (this box)	1.7.40	healed manually via `systemd-run chmod 755 /opt/archipelago/web-ui` after v1.7.38/39 bug
.198	1.7.39 → 1.7.40-alpha	healed manually
.228 (primary test)	1.7.40-alpha	healed manually; bitcoin-core + lnd + electrumx running; UI companions currently missing; bitcoin.conf rpcauth patched live
.249 (ISO test)	unreachable today
.253	1.7.39 → 1.7.40-alpha	healed manually

Known open issues (drives the plan below)

UI companion containers disappear on .228 after daemon restarts — no auto-recreate (fixed by v1.7.45 Quadlet migration)
bitcoin.conf rpcauth drifts from canonical secret → ElectrumX "Daemon connection problem" (fixed by v1.7.43 reconcile::derived)
host.containers.internal resolves to LAN gateway inside containers on some versions (fixed by v1.7.42 containers.conf)
Podman state DB loss requires manual recovery (fixed by v1.7.44 startup self-heal)
LND "Connect Wallet" info vanishing after crashes — symptom of the same drift class as #2
ElectrumX not syncing on .228 — downstream of #2; will resolve when bitcoin.conf is reconciled

Recent field incident (2026-04-22)

Shipped v1.7.38 + v1.7.39, both broke nginx fleet-wide because the frontend tarball's root dir was drwx------ (700). Every node that OTA'd got 500 errors on every page.
Root-cause fix shipped in v1.7.40 (create-release-manifest.sh chmod + pre-ship assertion that tar tvzf | head -1 shows drwxr-xr-x).
.160 Gitea was down all day (502) because its rootless podman's libpod/bolt_state.db had vanished. Recovered via clearing /run/user/$UID/{containers,libpod,podman} + podman system renumber.
Full failure-mode audit is in bulletproof-containers.md.

Plan

We're shipping a level-triggered reconciler + Quadlet architecture over six incremental releases. Each release closes one failure mode. See bulletproof-containers.md for the full design, code layout, test harness, chaos matrix, sources.

Release roadmap

Release	Closes	What lands	Status
v1.7.41	FM5 (bad OTA nginx 500)	Post-OTA auto-rollback. New binary probes `https://127.0.0.1/` on boot; if non-200 within 90s, restores `web-ui.bak` + calls `rollback_update()` + restarts	in flight — deploying to .228 for test
v1.7.42	FM4 (`host.containers.internal` wrong)	`/etc/containers/containers.conf` w/ `host_containers_internal_ip = 10.89.0.1`; every container gets `--add-host=host.archipelago:10.89.0.1`	pending
v1.7.43	FM2 (config drift)	`reconcile::derived::render_bitcoin_conf` — pure fn over canonical secret, rewrites on drift. Same for `lnd.conf`	pending
v1.7.44	FM6 (podman state loss)	Startup probe detects broken podman state, auto-recovers via `/run/user/$UID/*` clear + `system renumber`	pending
v1.7.45	FM1 + FM3 (companion orphans)	`archy-bitcoin-ui` → Quadlet `.container` unit in `/etc/containers/systemd/`. systemd (not archipelago) owns it	pending
v1.7.46	—	`archy-lnd-ui` → Quadlet	pending
v1.7.47	—	`archy-electrs-ui` → Quadlet	pending
v1.7.48+	all (full daemon refactor)	`core/archipelago/src/reconcile/` module replaces imperative `install.rs` container management. Main app containers become Quadlet too	pending

Test harness (bats + Goss + Chaos Toolkit + vmtest) lands scaffold in v1.7.41, first lifecycle tests blocking v1.7.45, full matrix blocking beta tag.

Release history

v1.7.41-alpha — IN FLIGHT — 2026-04-22

Post-OTA auto-rollback. After an update lands, the node probes its own web UI through nginx — if the frontend isn't answering cleanly within 90 seconds, the node automatically rolls back to the previous version and restarts. A bad release can no longer leave the fleet stranded on an unreachable node.

Changes:

core/archipelago/src/update.rs: PendingVerification struct, write marker before service restart, verify_pending_update() on new binary boot — probes https://127.0.0.1/, on fail restores web-ui.bak + calls rollback_update() + systemctl restart archipelago
core/archipelago/src/main.rs: startup task invokes verifier concurrently with server

v1.7.40-alpha — 2026-04-22

Proper fix for the 500 error. Fixed the v1.7.38/39 tarball-perms bug at its source — staging dir is now explicitly chmod 755 before tar; --mode=u=rwX,go=rX normalizes archive perms; pre-ship assertion aborts release if tar tvzf | head -1 isn't drwxr-xr-x.

Changes:

scripts/create-release-manifest.sh: pre-tar chmod + tar --mode flag + post-tar verify
Everything from .38 + .39 still in place (onboarding auto-heal, silent logins, app purge, AIUI in tarball)

v1.7.39-alpha — 2026-04-22

Hotfix attempt for v1.7.38's nginx 500 (didn't fully work — still shipped broken tarball perms). Added startup self-heal chmod in main.rs and post-extract chmod in update.rs OTA applier.

v1.7.38-alpha — 2026-04-22

Onboarding auto-heal + silent logins + App Store trim.

Changes:

auth.rs: is_onboarding_complete() auto-heals from setup_complete + password_hash (prevents clear-cache → onboarding wizard bug)
useOnboarding: tri-state — backend-unreachable no longer defaults to /onboarding/intro
Login sounds gated by isFirstInstallPhase() — silent after onboarding, typing sounds unaffected
Removed FIPS app, Nostr Relay, Nostr VPN, Routstr, Penpot from catalog + Rust + docker + icons
Deleted 15 image versions from tx1138, .168, gitea-local registries
AIUI baked into release tarball via demo/aiui/
prebuild hook syncs app-catalog/catalog.json → public/catalog.json

(Shipped with tarball-perms bug; fleet had to be healed before v1.7.40.)

v1.7.37-alpha — 2026-04-22

Bitcoin Core install fixes + dynamic node UI + full-archive default.

Bitcoin Core passes explicit -rpcbind/-rpcallowip/etc. CLI args so vanilla image exposes RPC
Split bitcoin-core from bitcoin-knots in backend AppMetadata
bitcoin-ui auto-detects Core vs. Knots from subversion, swaps branding at runtime
Storage (Full Archive · X GB / Pruned) indicator on dashboard
Node Settings modal shows real values (network, storage, txindex, ZMQ, RPC port)
Pull fallback to docker.io when no mirror carries the image
Removed prune=550 hardcode — full archive default

Key docs

bulletproof-containers.md — full reconcile architecture, code layout, test matrix, chaos scenarios, sources
BETA-RELEASE-CHECKLIST.md — existing beta checklist
BETA-ISSUES-20260328.md — prior beta-blocker tracking
hotfix-process.md — release workflow
architecture.md — system architecture overview

How to resume

Check fleet mirrors are all live: curl -sS https://git.tx1138.com/lfg2025/archy/raw/branch/main/releases/manifest.json | jq .version
Read bulletproof-containers.md for the current plan
Check task list (/list or via Claude Code) for the in-flight release
Latest in-flight work: v1.7.41 deploying to .228 for test; will ship to all 4 mirrors once verified

56 KiB Raw Permalink Blame History Unescape Escape

RESUME HERE — Rust orchestrator migration

✅ INSTALL UX POLISH + .23 RETIREMENT — SHIPPED (v1.7.43-alpha)

✅ ASYNC-SPAWN LIFECYCLE FIX — SHIPPED (Stop/Start/Restart + Install/Uninstall/Update)

Follow-ups to consider

⚡ NEXT SESSION — START HERE (historical — fix above is now shipped)

How to work on this repo (SSH + SSHFS setup)

FUSE / SSHFS development loop

SSH keys — what's where

Sudo — verified state

Cargo / npm / paths

Deploying new server binary to .228

Git workflow

Other

Hosts reference (quick)

Bug being fixed

Decisions already locked in (do not re-ask)

Implementation order (4 commits, local only)

Verification gates (do not skip)

Key files (exact lines of interest)

Chaos harness (not in repo — lives on .116)

Pre-existing bugs still deferred (do not fix until Stop UX lands)

Where we are

Post-Step 9 bug hunt (.228, 2026-04-23)

Step 9 evidence (.228, 2026-04-23)

Bugs fixed this session

Commits made this session

Uncommitted state

Answered design questions (no need to re-ask)

Context: which host is what

Next action

Why Step 8 got split (discovered 2026-04-23)

Archipelago — Current State, Plan, and Releases

Current state

Fleet status

Known open issues (drives the plan below)

Recent field incident (2026-04-22)

Plan

Release roadmap

Release history

v1.7.41-alpha — IN FLIGHT — 2026-04-22

v1.7.40-alpha — 2026-04-22

v1.7.39-alpha — 2026-04-22

v1.7.38-alpha — 2026-04-22

v1.7.37-alpha — 2026-04-22

Key docs

How to resume

56 KiB

Raw Permalink Blame History