Adds a new top section to STATUS.md covering v1.7.43-alpha: - Round 3: phase-based install progress bar - Round 4: post-install scanner kick for instant Launch button - Round 5: .23 VPS retirement, .168 promoted to Server 1 - Config migration: auto-purge .23 from saved registry/mirror JSONs - Changelog: new v1.7.43-alpha entry in AccountInfoSection All 5 commits, deployment md5, verification notes, and git remote cleanup captured. Round 2 rollback command still valid for the full stack since backups predate every round in this session.
56 KiB
RESUME HERE — Rust orchestrator migration
Updated: 2026-04-23 (Install UX polish: phase-based progress bar, post-install scanner kick for instant Launch button, .23 VPS retired with auto-purge migration, frontend/backend deployed to .228 as v1.7.43-alpha.)
To resume this work, SSH into the ThinkPad and run opencode from ~/Projects/archy/. Or work from the laptop via the SSHFS mount at ~/mnt/archy-thinkpad/.
✅ INSTALL UX POLISH + .23 RETIREMENT — SHIPPED (v1.7.43-alpha)
Rounds 3–5 + config migration + changelog (2026-04-23) — 5 commits on main (unpushed per user mirror protocol):
8cc84ebcfeat(install): phase-based progress bar replaces unparseable pull bytes—podman pullemits zero parseable progress when stderr is piped (no TTY), so the legacy byte-counting regex never matched. Replaced with 7 phase-based levels: Preparing (5%) → PullingImage (20%) → CreatingContainer (70%) → StartingContainer (80%) → WaitingHealthy (88%) → PostInstall (95%) → Done (100%). UI maps phases to fixed % and only advances forward (Math.max). Final phase label renamed from "Running post-install…" to "Finalizing…" after user feedback that it read like a regression to the install step.f86d86c3fix(install): kick scanner post-install so Launch button appears immediately— scan runs every 60s; post-install the state flipped to Running but the skeletal install-time manifest (interfaces: None) persisted until next scan, socanLaunch(pkg)returned false for up to a minute. Addedscan_kick: Arc<Notify>+scan_tick: Arc<watch::Sender<u64>>onRpcHandler. Scan loop usestokio::select!between the 60s interval and the notify. Newkick_scanner_and_waithelper (2s timeout) called in install/update success paths BEFORE writing Running, so a fresh manifest lands first. Merge during Installing/Updating usesmerge_preserving_transitional(keeps state, takes fresh manifest).22052325chore: retire .23 VPS mirror, promote .168 OVH to primary— droppedDEFAULT_TERTIARY_MIRROR_URL, promoted.168toDEFAULT_SECONDARY_MIRROR_URLas "Server 1 (OVH)". 2-entry default registry (.168 priority 0, tx1138 priority 10). Trusted-registry allowlist, catalog fallback, installer ISO registries,marketplaceData.tsREGISTRY,image-versions.shall updated. Tests updated for new default counts (registry 3→2, mirror 3→2). URL-parser fixture tests inupdate.rsretain.23strings intentionally — they exercise string-parsing logic, not policy.0ee16820fix(config): auto-purge decommissioned .23 VPS from saved registry/mirror configs—load_mirrors/load_registriesnormally only ADD missing defaults (explicit removals stick, by design). Existing nodes have.23baked into their savedupdate-mirrors.json+config/registries.jsonand would pay timeouts forever against a dead host. Added targeted one-time migration in both loaders:.retain(|m| !m.url.contains("23.182.128.160"))before the defaults-merge step. Narrow-scope exception to the stickiness rule, documented in-code. Triggers lazily on next load (install RPC, update RPC, Settings UI open).008da477docs(changelog): add v1.7.43-alpha entry covering async lifecycle + .23 retirement— 4 release-note bullets inAccountInfoSection.vuedescribing async-spawn, phase progress, scanner kick, and .23 retirement from the operator's perspective. Historical "Server 3 (OVH)" entries in older changelog blocks left intact — they describe what shipped at the time.
Deployed to .228:
- Backend binary md5
d2b619949f19815faaeab10429e36ba0at/usr/local/bin/archipelago. - Frontend at
/opt/archipelago/web-ui/(includes marketplaceData.ts .168 update + v1.7.43-alpha changelog entry). Deployed bundle verified:.168present inSettings-*.js+Marketplace-*.js,.23absent from all assets. /var/lib/archipelago/update-mirrors.json+config/registries.jsonwere manually deleted + regenerated with new defaults during Round 5 verification; migration code will handle any other node on first load.- Rollback targets from Round 2 still valid:
/usr/local/bin/archipelago.bak-pre-async-install+/opt/archipelago/web-ui.bak-pre-async-install/.
Git remotes cleaned on .116 (working-copy change only, not in any commit):
git remote remove gitea-vps(dropped the .23 Gitea remote).git remote set-url --delete --push origin http://.../23.182.128.160:3000/...(dropped .23 from origin multi-push alias).- Remaining push targets:
tx1138(canonical),gitea-local(localhost Gitea),gitea-vps2(.168 OVH).
Rollback Rounds 3–5 (same command as Round 2 — backups predate all of this):
ssh archy228 'sudo cp -a /usr/local/bin/archipelago.bak-pre-async-install /usr/local/bin/archipelago && sudo rsync -a --delete /opt/archipelago/web-ui.bak-pre-async-install/ /opt/archipelago/web-ui/ && sudo systemctl restart archipelago && sudo systemctl reload nginx'
✅ ASYNC-SPAWN LIFECYCLE FIX — SHIPPED (Stop/Start/Restart + Install/Uninstall/Update)
Round 2 (2026-04-23, install/uninstall/update) — 3 commits on main:
2d5b859efeat(rpc): async-spawn install/uninstall/update lifecycle— newapi/rpc/package/async_lifecycle.rswithspawn_package_install,spawn_package_uninstall,spawn_package_update. Dispatcher + handler threadself: Arc<Self>so spawned tasks own their Arc. Install/update Ok arms explicitly setRunningbecausemerge_preserving_transitionalrefuses to let the scanner overwriteInstalling/Updating. Removed redundant inner "already updating" guard inupdate.rs. Transient install entry uses empty icon (see commit 3 rationale).0733ac40fix(ui): shorten install/uninstall/update timeouts for async RPCs— drop 11m/45m timeouts to 15s acrossrpc-client.ts,stores/server.ts, and the 5 direct call sites inMarketplace.vue,Discover.vue,MarketplaceAppDetails.vue. Return types updated to{ status, package_id }.e471ef75fix(rpc): empty icon in transient install entry to avoid broken-image flicker—progress.rs::create_installing_entryno longer hardcodes/assets/img/app-icons/<id>.png. About half of bundled apps use.svg/.webpicons; the frontend's fallback chain (backend_icon || curated.icon || placeholder) now lands on the correct curated extension.
Deployed to .228 (binary md5 f66857b3b8b3640c8cac8bd25fe508ec at /usr/local/bin/archipelago, backup at /usr/local/bin/archipelago.bak-pre-async-install; frontend at /opt/archipelago/web-ui/, backup at /opt/archipelago/web-ui.bak-pre-async-install/). User confirmed: uninstall fast and responsive, install of LND + SearXNG clean, icon flicker fixed.
Known out-of-scope issue: Vaultwarden container itself exits immediately on start with an internal error. The async wrapper correctly detects this via post-start exit verification and removes the state entry. Needs separate vaultwarden container-config investigation.
Rollback Round 2 (if ever needed):
ssh archy228 'sudo cp -a /usr/local/bin/archipelago.bak-pre-async-install /usr/local/bin/archipelago && sudo rsync -a --delete /opt/archipelago/web-ui.bak-pre-async-install/ /opt/archipelago/web-ui/ && sudo systemctl restart archipelago && sudo systemctl reload nginx'
Round 1 (Stop/Start/Restart) — 4 commits on main (unpushed per user mirror protocol):
44cd5eeffeat(rpc): spawn_transitional helper for async lifecycle ops— newapi/rpc/transitional.rswithOp::{Stop,Start,Restart}andRpcHandler::spawn_transitional/flip_to_transitional/set_statehelpers.install_logre-exported so sibling modules can use it.19a99ca9fix(rpc): async container stop/start/restart; widen state mapping—container.rsstart/stop rewritten + restart added;container-listnow emits all transitional variants instead of falling back to"unknown".dispatcher.rsregisterscontainer-restart.package/runtime.rsmirrored withdo_package_*helpers insidetokio::spawnand revert-on-error.6712810bfix(state): preserve transitional state across container scans—server.rsscan merge now keeps transitional states while taking fresh observability fields; 1200s stuck-timeout escape hatch viatransitional_since: HashMap<String, Instant>. Three passingserver::merge_tests.9ce28f08fix(ui): single-button lifecycle control with transitional labels—ContainerApps.vueandContainerAppDetails.vueuse a single primary button driven bygetAppVisualState(). Dashboard now routes throughcontainer-start/container-stop(the async RPCs) instead of the legacy synchronousbundled-app-*path.ContainerStatus.vuewidened to render all new variants.
Deployed to .228 (ThinkPad demo device):
- Binary at
/usr/local/bin/archipelago(md5de86b63f74c7e6fe6e555ffe30b86b4f), backup at/usr/local/bin/archipelago.bak-pre-async-stop. - Frontend at
/opt/archipelago/web-ui/, backup at/opt/archipelago/web-ui.bak-pre-async-stop/. - Release build took 3m56s on .116. Deploy via scp + atomic
install -m 755+systemctl restart archipelago.nginx -t+systemctl reload nginxfor frontend.
Manual verification: user clicked Stop on LND in the dashboard. Button flipped to Stopping… instantly, held for the full graceful-stop window, transitioned to Start when podman stop completed. No mid-flight revert to Running. User sign-off: "absolutely beautiful".
Rollback (if ever needed):
ssh archy228 'sudo cp /usr/local/bin/archipelago.bak-pre-async-stop /usr/local/bin/archipelago && sudo rsync -a --delete /opt/archipelago/web-ui.bak-pre-async-stop/ /opt/archipelago/web-ui/ && sudo systemctl restart archipelago && sudo systemctl reload nginx'
Follow-ups to consider
- Chaos matrix / Step 11 — the original next-step gated behind this fix. Now unblocked.
- bundled-app-start / bundled-app-stop — still synchronous in the backend. Dashboard no longer calls them, but the RPC methods remain for any external caller. Decide: deprecate, or mirror the async-spawn treatment for parity.
transitional_sincepersistence — currently in-memory only, so a backend restart mid-stop loses the timeout anchor. Acceptable for now (scan loop re-observes live podman state and reconciles), but worth revisiting if crash-recovery stories tighten.- Test regressions inventory — the full
cargo test -p archipelagorun on .116 shows 22 pre-existing failures in unrelated modules (mesh/wallet/credentials/avatar/session/transport/update-mirrors/fips/identity_manager/image_versions). Unrelated to this work but tech debt. Log at/tmp/cargo-test-all.logon .116. - Amend STATUS.md's older "NEXT SESSION — START HERE" section (below) — it is now stale. Left in place for historical reference of how the fix was designed; delete on the next pass if it gets confusing.
⚡ NEXT SESSION — START HERE (historical — fix above is now shipped)
Goal: implement async-spawn lifecycle fix so the dashboard never shows a frozen spinner again. User mandate: "best server containers in the world". Do not ship the chaos matrix (Step 11) until this lands and manual LND stop verifies instant RPC + live Stopping… label.
How to work on this repo (SSH + SSHFS setup)
You are likely running on the laptop (macOS). The repo lives on the ThinkPad (.116). There are two access paths, use both in parallel:
- SSHFS mount at
~/mnt/archy-thinkpad/— for all file ops (read/edit/write/glob/grep). - Direct SSH — for everything that isn't file ops:
git,cargo,npm,systemctl, running the server, tailing logs.
See the "FUSE / SSHFS development loop" section below for the full mount lifecycle — that's the thing that makes this dev setup work, and it will break periodically.
FUSE / SSHFS development loop
Why this exists: editing the repo directly on the ThinkPad over raw SSH means no IDE, no tool-native file reads, no glob/grep speed. SSHFS mounts the remote filesystem as a local directory so OpenCode's file tools work transparently. But SSHFS is a leaky abstraction — know the gotchas or you'll waste hours.
Stack (macOS laptop):
- macFUSE — kernel extension providing FUSE on macOS. Install via
brew install --cask macfuse(requires reboot + security approval in System Settings the first time). - sshfs — userspace mount tool. Install via
brew install gromgit/fuse/sshfs-mac(the homebrew coresshfswas removed; use this tap). - Verify:
which sshfs→/opt/homebrew/bin/sshfs,sshfs --version→SSHFS version 2.10 / FUSE library version 2.9.9.
Actual mount command currently running (verified from ps):
sshfs archy:Projects/archy /Users/dorian/mnt/archy-thinkpad \
-o reconnect,ServerAliveInterval=15,ServerAliveCountMax=3,volname=archy-thinkpad
Breakdown:
archy:Projects/archy— remote path via thearchySSH alias (uses~/.ssh/archy_opencode, no password prompt).~/mnt/archy-thinkpad— local mount point. Create once:mkdir -p ~/mnt/archy-thinkpad.reconnect— sshfs auto-reconnects if the TCP session drops (WiFi flap, laptop sleep). Without this, the mount turns into a zombie immediately.ServerAliveInterval=15— sends a keepalive every 15s.ServerAliveCountMax=3— disconnect after 3 missed keepalives (45s). Tune up if your network is flaky.volname=archy-thinkpad— Finder display name.
Check mount health:
mount | grep archy-thinkpad
# should print: archy:Projects/archy on /Users/dorian/mnt/archy-thinkpad (macfuse, nodev, nosuid, synchronous, mounted by dorian)
ls ~/mnt/archy-thinkpad/ | head
# should list repo contents fast (<1s). If it hangs, mount is stale.
Recovery when the mount hangs / goes stale (this WILL happen — laptop sleeps, WiFi drops, ThinkPad reboots):
# 1. Force-unmount (macOS — `umount` alone often fails on a hung FUSE mount)
sudo diskutil unmount force ~/mnt/archy-thinkpad
# fallback if diskutil can't see it:
sudo umount -f ~/mnt/archy-thinkpad
# 2. Kill any zombie sshfs process
pkill -f "sshfs archy:Projects/archy"
# 3. Remount
sshfs archy:Projects/archy ~/mnt/archy-thinkpad \
-o reconnect,ServerAliveInterval=15,ServerAliveCountMax=3,volname=archy-thinkpad
# 4. Verify
ls ~/mnt/archy-thinkpad/ | head
If the mount point itself got wedged (ls: /Users/dorian/mnt/archy-thinkpad: Device not configured), the sequence above still works — macFUSE garbage-collects the inode after the force-unmount.
When to use which path (rules, not suggestions):
| Operation | Use | Why |
|---|---|---|
read / edit / write |
SSHFS mount | OpenCode tools want local paths |
glob / grep |
SSHFS mount | Local FS traversal is fine; remote would need rg over SSH |
| Reading many files | SSHFS mount | Each read is a round-trip but parallelizable |
git status / git diff / git log |
SSH | Git over FUSE is painfully slow (lots of stat calls) |
git add / git commit |
SSH | Same — commit times grow linearly with tree size on FUSE |
cargo check / cargo test / cargo build |
SSH | Compiling over FUSE would take hours; cargo's incremental stat pattern destroys FUSE performance |
npm install / npm run build |
SSH | Same reason — massive file churn |
| Running the server / tailing journal | SSH | Service lives on .116 |
| Deploying to .228 | SSH from .116 | SCP from ThinkPad; laptop isn't in the critical path |
Don't do this (will bite you):
cargo buildfrom the mount — will try to write target/ over FUSE, gets orders of magnitude slower, may hang.rsyncwithout--exclude="._*"— macOS writes AppleDouble metadata files, they leak to the remote as._*siblings of every real file..gitignorealready excludes them (commit13858842), but they clutter the tree.- Writing big binary files via the mount — use
scpover SSH instead. - Relying on file-change-watcher tools (watchman, chokidar) — they get confused by FUSE event semantics.
Editing workflow in a typical session:
- Laptop: OpenCode
reads a file via/Users/dorian/mnt/archy-thinkpad/.... FUSE fetches it over SSH, caches briefly. - Laptop: OpenCode
edits the file — FUSE writes the new bytes back to .116 immediately (synchronous mount). - Laptop:
ssh archy "cd ~/Projects/archy && ~/.cargo/bin/cargo check -p archipelago"— runs on the real filesystem on .116, sees the edit. - Laptop:
ssh archy "cd ~/Projects/archy && git diff path/to/file"— confirms the edit landed. - Laptop:
ssh archy "cd ~/Projects/archy && git add path/to/file && git commit -m '...'"— commit from .116.
The SSHFS mount and the SSH shell are pointing at the same inodes — edits via the mount are instantly visible to cargo/git over SSH. There's no "sync" step.
Cache caveat: macFUSE caches attributes briefly (default ~1s). If you write via SSH and read via the mount within that window, you may see stale metadata. The mount's synchronous flag (visible in mount output) minimizes but doesn't eliminate this. If you get a weird diff between what SSH and the mount report, re-read after a second, or stat --file-system ~/mnt/archy-thinkpad/<file> to force a refresh.
Direct SSH access (use when FUSE isn't the right tool):
ssh archy→archipelago@192.168.1.116using~/.ssh/archy_opencodessh archy228→archipelago@192.168.1.228using~/.ssh/archy_opencode- Full host form also works:
ssh archipelago@192.168.1.116/ssh archipelago@192.168.1.228(same key resolves via IdentitiesOnly).
SSH keys — what's where
Laptop ~/.ssh/ (macOS, user dorian):
| File | Purpose |
|---|---|
archy_opencode / .pub |
Primary key for this project. Unlocks both archy (.116) and archy228 (.228). Created 2026-04-22 specifically for OpenCode work. |
archipelago-deploy / .pub |
Older archipelago deploy key. Not needed for current work. |
id_ed25519 / .pub |
Personal default key. Not used by archy/archy228 configs (IdentitiesOnly yes forces archy_opencode). |
id_ed25519_angor / .pub |
Angor project. Unrelated. |
id_ed25519_start9 / .pub |
Start9 project. Unrelated. |
vps-ci-setup / .pub |
VPS CI. Unrelated. |
config |
Host aliases (shown above) |
.116 /home/archipelago/.ssh/:
| File | Purpose |
|---|---|
authorized_keys |
Accepts: laptop's archy_opencode.pub + 3 other keys (4 lines total). |
id_ed25519 / .pub |
.116's OWN identity key. This is what lets .116 → .228 work passwordless. |
archipelago-deploy |
Symlink → id_ed25519 (legacy alias). |
id_ed25519_vps168 / .pub |
For SSH to 146.59.87.168 (VPS). Unrelated to this work. |
config |
Host entry for the VPS only. |
.228 /home/archipelago/.ssh/:
| File | Purpose |
|---|---|
authorized_keys |
Accepts: laptop's archy_opencode.pub + .116's id_ed25519.pub + 2 others (4 lines total). |
(no id_ed25519) |
.228 has no outbound key — it's a terminal node. Don't try to ssh from .228 to anywhere. |
Connectivity matrix (all verified 2026-04-23):
| From → To | Works passwordless | Via |
|---|---|---|
| Laptop → .116 | ✅ | archy_opencode |
| Laptop → .228 | ✅ | archy_opencode |
| .116 → .228 | ✅ | .116's id_ed25519 |
| .228 → anywhere | ❌ | no outbound key (by design) |
Sudo — verified state
.116 (dev ThinkPad):
- User
archipelagois insudogroup. - Sudo password required:
ThisIsWeb54321@ - Sudoers drop-ins present:
/etc/sudoers.d/archipelago-ci,/etc/sudoers.d/archipelago-wg(scope-limited NOPASSWD for specific CI/wg commands — not full NOPASSWD). - For most dev work you don't need sudo on .116.
.228 (prod kiosk):
- User
archipelagohas full passwordless sudo via/etc/sudoers.d/archipelagocontainingarchipelago ALL=(ALL) NOPASSWD:ALL. - User is also in
sudogroup. - Sudo password (if ever prompted, shouldn't be):
archipelago - Dashboard password:
password123
Cargo / npm / paths
- Cargo PATH gotcha: non-interactive SSH login has no cargo in PATH. Always use
~/.cargo/bin/cargoover SSH.- Example:
ssh archy '~/.cargo/bin/cargo check -p archipelago' --workdir ~/Projects/archy/core - Or cd first:
ssh archy 'cd ~/Projects/archy && ~/.cargo/bin/cargo check -p archipelago'
- Example:
- Long cargo builds (>2 min Bash tool timeout): launch detached and poll the log:
ssh archy 'cd ~/Projects/archy && nohup ~/.cargo/bin/cargo build --release -p archipelago > /tmp/cargo-build.log 2>&1 < /dev/null & disown' ssh archy 'tail -30 /tmp/cargo-build.log' ssh archy 'pgrep -a cargo' # to check if still running - npm / frontend lives at
~/Projects/archy/neode-ui/on .116 (also accessible via laptop mount at~/mnt/archy-thinkpad/neode-ui/). Node is on interactive PATH; for scripted SSH,source ~/.nvm/nvm.sh && nvm useor call the absolute path if nvm is used. - Repo on .116:
~/Projects/archy/(Cargo workspace atcore/Cargo.toml). - Web root on .228: check
/etc/nginx/sites-enabled/for the live path; historically/var/lib/archipelago/web-ui/or/opt/archipelago/web-ui/.
Deploying new server binary to .228
# 1. Build on .116 (detached — takes ~3-5 min for release)
ssh archy 'cd ~/Projects/archy && nohup ~/.cargo/bin/cargo build --release -p archipelago > /tmp/cargo-build.log 2>&1 < /dev/null & disown'
# wait / tail log until "Finished `release` profile"
# 2. SCP .116 → .228 (uses .116's id_ed25519 → .228's authorized_keys, passwordless)
ssh archy 'scp ~/Projects/archy/core/target/release/archipelago archipelago@192.168.1.228:/tmp/archipelago.new'
# 3. Atomic swap on .228 with backup
ssh archy228 'sudo cp /usr/local/bin/archipelago /usr/local/bin/archipelago.bak-pre-async-stop && sudo mv /tmp/archipelago.new /usr/local/bin/archipelago && sudo chmod +x /usr/local/bin/archipelago && sudo systemctl restart archipelago'
# 4. Verify
ssh archy228 'systemctl status archipelago --no-pager | head -20 && sudo journalctl -u archipelago -n 50 --no-pager'
Git workflow
- Branch:
mainon .116, currently 22 commits ahead oftx1138/main. - Remote
tx1138exists but do NOT push — user mirrors to 4 Gitea remotes personally after reviewing. - Atomic commits, one logical change per commit. Conventional Commits format (
feat:,fix:,docs:,refactor:,chore:,test:,perf:). - Never
--amendunless the commit you're amending was created in this session AND has not been pushed. Safer: new commit. - Never
--forcepush. Never modify git config. - If pre-commit hooks fail, create a NEW commit with the fix — don't
--amendafter a failed commit.
Other
- Full destructive latitude on both nodes. Announce multi-hour ops (OTA, full rebuild, apt upgrade). Don't ask for routine stop/start/rebuild permission.
- No ship pressure. Do it properly.
- Use
questiontool for ambiguous decisions (don't guess user intent on design choices). - Keep
docs/STATUS.mdfresh between sessions — it IS the session handoff.
Hosts reference (quick)
| Host | IP | SSH alias | Role | Dashboard | Sudo |
|---|---|---|---|---|---|
archy (ThinkPad X250) |
192.168.1.116 | ssh archy |
dev host, Debian 13 | archipelago |
ThisIsWeb54321@ |
archy228 (HP ProDesk) |
192.168.1.228 | ssh archy228 |
prod kiosk, Rust orchestrator | password123 |
NOPASSWD (fallback archipelago) |
Bug being fixed
Dashboard sequence when user clicks Stop LND:
- UI collapses Start/Stop buttons to single spinner-button ("Stopping…") via
loadingApps.add('lnd'). - Frontend calls
container-stopRPC. Server runspodman stop -t 330 lndsynchronously inside the RPC handler (viaorchestrator.stop()). RPC blocks up to 5.5 min for LND (330s timeout + overhead). - Meanwhile the 30-second package-scan loop in
server.rs:scan_and_update_packageskeeps running. It rebuildsPackageDataEntryfrom podman inspect — podman still reportsrunning(stop hasn't completed) — and blindly overwrites the store entry atserver.rs:854. container-listRPC readsstate_managersnapshot → returnsstate = "running".- Frontend polling sees
running→getAppState()returns'running'→ the two-button (Start | Stop) block re-renders → the transitional button disappears → UI looks like the stop silently failed. - Eventually
podman stopfinishes → next scan → state flips toStopped→ buttons change again.
Net visible bug: button spins briefly, reverts to Running, then several minutes later suddenly shows Stopped. User rightly calls this "out of sync and confusing".
Decisions already locked in (do not re-ask)
- Full scope fix (not minimal hotfix). User chose "Go full scope, do it right".
- Async-spawn lives in the RPC layer, not in the
ContainerOrchestratortrait. Trait stays synchronous so the reconciler, boot flow, unit tests, and the chaos harness retain deterministic behaviour. PackageStatealready hasStopping/Starting/Restarting/Installing/Updating/Removingvariants — enum atcore/archipelago/src/data_model.rs:107-124. No schema change needed.- UI collapses to one full-width button with spinner during every transitional state. Labels: Start / Stop / Starting… / Stopping… / Restarting… / Installing… / Updating… / Removing… / Install (when
not-installed). - Helper API shape:
RpcHandler::spawn_transitional(op: Op, app_id: String)whereOpis an enum{Stop, Start, Restart}. Helper dispatches toorchestrator.stop/start/restartinternally, knows each op's transitional+final states, handles error → revert +install_log(). mark_user_stoppedmust run BEFORE the spawn (preserves ordering the crash recovery layer depends on — seeruntime.rs:145-148).
Implementation order (4 commits, local only)
Commit 1 — feat(rpc): spawn_transitional helper for async lifecycle ops
- New file:
core/archipelago/src/api/rpc/transitional.rs(or extendcontainer.rs; prefer new file for cohesion with future stacks/package variants) enum Op { Stop, Start, Restart }withtransitional_state(),final_state_on_success(),log_prefix(), and asyncdispatch(&orch, &app_id)methodimpl RpcHandler { pub(super) async fn spawn_transitional(&self, op: Op, app_id: String) -> Result<()> }- Capture
Arc<dyn ContainerOrchestrator>+Arc<StateManager>clones - Set transitional state via
state_manager.update_data()(if entry exists; skip if not — Start on never-installed shouldn't create an entry) tokio::spawn(async move { ... })- Inside spawn:
install_log("{LOG_PREFIX}: {app_id}"),op.dispatch(&orch, &app_id).await, on success set final state, on error log +install_log("{LOG_PREFIX} FAIL: …")+ revert state to previous (cache pre-transition state in a local) - Return
Ok(())immediately after spawn
- Capture
Commit 2 — fix(rpc): async container stop/start/restart; widen state mapping
api/rpc/container.rs:85-107— rewritehandle_container_stopbody:validate_app_id,mark_user_stopped,spawn_transitional(Op::Stop, app_id.to_string()).await?, returnOk(json!({ "status": "stopping" }))api/rpc/container.rs:61-83— rewritehandle_container_start:clear_user_stopped,spawn_transitional(Op::Start, …), return{ "status": "starting" }- Add
handle_container_restart(currently missing incontainer.rs— only exists aspackage.restartatruntime.rs:176-242). Register RPC route namecontainer-restart. Add matching frontend client method incontainer-client.ts. api/rpc/container.rs:148-154— widen thecontainer-liststate mapping: add arms forStopping → "stopping",Starting → "starting",Restarting → "restarting",Installing → "installing",Updating → "updating",Removing → "removing",Installed → "installed",CreatingBackup/RestoringBackup/BackingUp→ their kebab-case strings. No more"unknown"fallback unless the variant is genuinely unknown.- Mirror same spawn treatment in
api/rpc/package/runtime.rs:handle_package_start(L28-119),handle_package_stop(L122-173),handle_package_restart(L176-242). Keep the existing verification loops (post-start exit-check at L82-117; restart stop+start fallback at L215-235) inside the spawned future, not in the RPC body.
Commit 3 — fix(state): preserve transitional state across container scans
server.rs:847-857— in the merge loop, before themerged.insert(id.clone(), pkg.clone())overwrite, checkmerged.get(id).stateand skip overwrite if it's transitional:matches!(existing.state, Installing | Stopping | Starting | Restarting | Updating | Removing | CreatingBackup | RestoringBackup | BackingUp)- Still allow non-state fields (lan_address, health, ports) to update. Simplest: when existing is transitional, keep
existing.statebut merge updated fields frompkg. Write a tiny helpermerge_preserving_transitional(existing, fresh) -> PackageDataEntry. - Unit test: construct
existing.state = Stopping,fresh.state = Running, assert merged.state staysStopping. - Also check: Is there a timeout escape hatch? If
Stoppingis set and podman actually finishes but the spawn died before writing the final state (process crash, panic), the entry will be stuckStoppingforever. Mitigation: track atransitional_since: Instantin the entry (not persisted, just in-memory side table on StateManager), and if > 2× the stop timeout has elapsed, allow podman scan state to override. Scope for this commit or follow-up — lean toward: include it, because fleet reliability matters.
Commit 4 — fix(ui): single-button lifecycle control with transitional labels
neode-ui/src/api/container-client.ts— extendContainerStatus.stateunion to:'created' | 'running' | 'stopped' | 'exited' | 'paused' | 'unknown' | 'stopping' | 'starting' | 'restarting' | 'installing' | 'updating' | 'removing' | 'installed'. AddrestartContainer(appId)method callingcontainer-restart.neode-ui/src/stores/container.ts— add computedgetAppVisualState(appId)that returns one of:'not-installed' | 'running' | 'stopped' | 'starting' | 'stopping' | 'restarting' | 'installing' | 'updating' | 'removing'. Mapsexited→stopped,created→stopped,paused→stopped,installed→stopped. AddrestartContainer(appId)action (setsloadingAppsfor request dedup, calls client, does NOTfetchContainersimmediately because server will broadcast state; a finalfetchContainersafter a short delay can backstop if WebSocket push is absent).neode-ui/src/views/ContainerApps.vue:85-136— replace the two-button conditional with a single full-width button bound togetAppVisualState(app.id). Table:visual state click action label spinner disabled not-installedinstallApp Install no no runningstopContainer Stop no no stoppedstartContainer Start no no starting— Starting… yes yes stopping— Stopping… yes yes restarting— Restarting… yes yes installing— Installing… yes yes updating— Updating… yes yes removing— Removing… yes yes - Add a separate Restart button next to the primary one when state is
running, calling newrestartContaineraction. Restart button hides while transitional.
- Add a separate Restart button next to the primary one when state is
neode-ui/src/views/ContainerAppDetails.vue:83(and full stop/start button blocks around L220, L232) — mirror the same single-button pattern.- Also audit line 239 of
ContainerApps.vue(some((app) => store.getAppState(app.id) === 'created')) and the logic around lines 276, 295, 309, 312 — make sure they usegetAppVisualStatewhere appropriate.
Verification gates (do not skip)
~/.cargo/bin/cargo check -p archipelagoon .116 via SSH~/.cargo/bin/cargo test -p archipelagoon .116 via SSH — at least the new merge helper test must pass- Build release binary on .116:
nohup ~/.cargo/bin/cargo build --release -p archipelago > /tmp/cargo-build.log 2>&1 < /dev/null & disown. Poll until done. - SCP binary to .228
/usr/local/bin/archipelago, back up prior to/usr/local/bin/archipelago.bak-pre-async-stop.sudo systemctl restart archipelagoon .228. - Manual LND stop test on .228:
- Open dashboard, confirm LND is Running (first:
ssh archipelago@192.168.1.228 'podman start lnd'— LND is currently Exited(0) from the demo) - Click Stop
- Expected: button immediately becomes "Stopping…" with spinner (RPC returns <1s)
- Dashboard should stay on "Stopping…" for ~5 min
- Then flip to "Start" button with label "Start"
- At no point should it revert to "Running" mid-stop
- Open dashboard, confirm LND is Running (first:
- Same test with Bitcoin Core stop (longest timeout, 600s)
- Frontend build:
cd ~/Projects/archy/neode-ui && npm run type-check && npm run build. Rsyncdist/toarchipelago@192.168.1.228:/var/lib/archipelago/web-ui/(or wherever the active web root is — check/etc/nginxon .228 first). - Then and only then: resume chaos matrix. First recover LND/ElectrumX via UI (great end-to-end test of the new async Start path), then run smoke → full 32-case matrix.
Key files (exact lines of interest)
core/archipelago/src/api/rpc/container.rs:85-107—handle_container_stop(blocking — target of fix)core/archipelago/src/api/rpc/container.rs:61-83—handle_container_startcore/archipelago/src/api/rpc/container.rs:148-154— narrow state mapping (drops transitional → "unknown")core/archipelago/src/api/rpc/package/runtime.rs:11-24—stop_timeout_secstable (reference, unchanged)core/archipelago/src/api/rpc/package/runtime.rs:122-173—handle_package_stop(also blocking, mirror treatment)core/archipelago/src/api/rpc/package/runtime.rs:28-119—handle_package_startcore/archipelago/src/api/rpc/package/runtime.rs:176-242—handle_package_restartcore/archipelago/src/api/rpc/package/progress.rs— existing broadcast pattern to mirror (set_install_progress,set_uninstall_stage)core/archipelago/src/api/rpc/mod.rs:62-100—RpcHandlerstruct (already holdsArc<dyn ContainerOrchestrator>+ state_manager)core/archipelago/src/server.rs:812-857—scan_and_update_packages(merge loop at L850-857 is where transitional-state clobber happens)core/archipelago/src/container/docker_packages.rs:636-663—convert_state+package_state_str(read-only reference, no change)core/archipelago/src/container/traits.rs—ContainerOrchestratortrait (stays synchronous, do not change)core/archipelago/src/crash_recovery.rs—mark_user_stopped/clear_user_stopped(call order preserved)core/archipelago/src/data_model.rs:107-124—PackageStateenum (no change — all variants exist)neode-ui/src/api/container-client.ts—ContainerStatustype + RPC methods (extend)neode-ui/src/stores/container.ts:93-312— Pinia store (addgetAppVisualState, addrestartContaineraction)neode-ui/src/views/ContainerApps.vue:85-136, 239, 276, 295, 309-312, 383— two-button block + state readsneode-ui/src/views/ContainerAppDetails.vue:83, 220, 232— details page Stop/Start
Chaos harness (not in repo — lives on .116)
archipelago@192.168.1.116:~/ui-chaos/— deployed, playwright + deps installed, smoke test for bitcoin-core passes (2.1 min). LND/ElectrumX/bitcoin-ui smoke tests not yet run (blocked on the async-stop fix landing; LND currently Exited on .228 from the demo)./tmp/chaos/on laptop — canonical source for rsync to .116.- Run:
cd ~/ui-chaos && npx playwright test tests/<spec> - Target: 32 cases = 4 core containers × 8 scenarios (install-fresh, graceful-stop, sigkill, rm-container, oom-kill, rm-image, restart-service, network-partition).
- Uses SSH+Playwright hybrid per design; includes the
bash -lc '<escaped>'single-quote fix for ssh argv flattening and JSON-parsedpodman inspectinstead of Go templates.
Pre-existing bugs still deferred (do not fix until Stop UX lands)
archipelago --versionspawns server (should be a pure CLI query)- RPC unknown-method returns generic error (should return method-not-found with the bad method name)
docker_packages.rsfilters out UI containers (archy-lnd-ui,archy-electrs-ui) — some views need them visiblelnd.lan_addressstale on .228- first-boot silent failure on some hardware
web-ui.failed.*scar on .228 (benign systemd unit state)test_parse_image_versionspre-existing broken assertion — fix or#[ignore]when touching that area
Where we are
Working through the 11-step plan in rust-orchestrator-migration.md.
- Step 1 —
3767c267ContainerConfig schema withbuild:,ResolvedSourceenum,resolve(), 10 tests - Step 2 —
34af4d9dContainerRuntime trait gainedimage_exists+build_image, 4 argv tests, 25/25 pass - Step 3 —
b6a04d31ProdContainerOrchestrator (999 LOC), 16 tests all pass, not yet wired to main.rs - Step 4 —
e8a59c93ContainerOrchestrator trait, RpcHandler uses it in prod (+13858842chore gitignore ._*) - Step 5 —
fc39b04bBootReconciler with Arc shutdown, 4 paused-time tests pass - Step 6 —
48f08aa3main.rs wire-up (orchestrator construction + adopt_existing + BootReconciler spawn + shutdown Notify) - Step 7 —
069bc4a5bitcoin-ui pre-start hook + embedded nginx.conf template (8 unit tests + 1 integration test), 39/39 container:: tests pass - Step 8a —
a0707f4dretire archipelago-reconcile.{service,timer} + ISO builder touchpoints, keep scripts for update.rs - Step 9 — Hot-swap on .228 verified. All three UIs (bitcoin-ui/lnd-ui/electrs-ui) installing + serving HTTP 200.
- .228 dashboard bugs — ExtraHost
192.168.1.254bug (3ee192ba) + LND macaroon permission bug (be960023). See "Post-Step 9 bug hunt" below. - Step 8b — Port remaining ~25 container creations from
first-boot-containers.shintoapps/<id>/manifest.yml, then portupdate.rsto orchestrator (deferred, multi-day work) - Step 8c — Rename
first-boot-containers.sh→first-boot-setup.sh, strip container ops, keep setup. Deletereconcile-containers.sh+container-specs.sh. Add ISO lines to copyapps/(final one-way door, requires 8b complete) - Step 10 — Hot-swap + verify on .116 (adoption-heavy test — .116 already has all containers running)
- Step 11 — Chaos matrix on both nodes (all 8 scenarios × all containers incl. bitcoin-core)
Post-Step 9 bug hunt (.228, 2026-04-23)
User reported three visible dashboard bugs after Step 9 verification:
- LND — "no connect details or QR"
- ElectrumX — stuck at "Building index (2 KB / ~130 GB)" for days
- bitcoin-core — in scope for chaos testing
Root cause #1 (ExtraHost, commit 3ee192ba): scripts/first-boot-containers.sh computed HOST_GATEWAY from ip route show default, which returns the LAN router (e.g. 192.168.1.254), not the gateway to the host. Every container configured with --add-host=host.containers.internal:$HOST_GATEWAY was dialing the WiFi router instead of the host. LND crash-looped with dial tcp 192.168.1.254:8332: connection refused; ElectrumX's DAEMON_URL hit the same dead end; any archy-net bridge consumer of bitcoin-core's RPC was broken. Fixed by replacing the computed value with podman's magic host-gateway literal (supported since 4.4; we ship 5.4.2). Live-recreated bitcoin-core/electrumx/lnd on .228 with the corrected --add-host; LND reached chain backend; ElectrumX resumed indexing (went from 2 KB → 164.9 MB in under an hour).
Root cause #2 (macaroon permissions, commit be960023): LND's admin.macaroon lives at /var/lib/archipelago/lnd/data/chain/bitcoin/mainnet/admin.macaroon, owned by rootless-podman subordinate UID 100000, mode 640. The archipelago server runs as host UID 1000 and literally cannot read the file. Every LND RPC (getinfo, connect-info, export-channel-backup) plus the shared lnd_client() helper failed with "Failed to read LND admin macaroon". Confirmed pre-existing on .116 too (long-standing bug unrelated to Step 9). Fix: centralised the path as LND_ADMIN_MACAROON_PATH, added a read_lnd_admin_macaroon() helper in api/rpc/lnd/mod.rs that tries direct read first then falls back to sudo -n cat (mirrors the pattern already used for Tor onion hostnames). Four call sites routed through the helper. Verified on .228 — curl -k https://<host>/lnd-connect-info now returns 200 with cert + macaroon + tor_onion; dashboard QR unblocked.
Step 9 evidence (.228, 2026-04-23)
- Binary: Step 9 build with
732df1b8+ba83f9bc, scp'd to .228 as/usr/local/bin/archipelago. Old binary backed up at/usr/local/bin/archipelago.bak-pre-step9. Later replaced with macaroon-fix build (be960023); previous backed up at/usr/local/bin/archipelago.bak-pre-macaroon. - DEV_MODE override disabled (
override.conf→override.conf.disabled-pre-step9). /opt/archipelago/apps/{bitcoin-ui,electrs-ui,lnd-ui}/manifest.ymlpopulated./opt/archipelago/docker/bitcoin-ui/Dockerfilereplaced with the Step 7 version (noCOPY nginx.conf). Old dir backed up asbitcoin-ui.bak-pre-step9.- Post-start snapshot:
🔗 Adopted 1 existing container(s): ["electrs-ui"]— adoption of 13h-running container worked without recreation🔄 Boot reconciler started (interval: 30s)— every 30s, all three app_ids reachNoOpafter the initial install passbitcoin-ui nginx.conf rendered path=/var/lib/archipelago/bitcoin-ui/nginx.conf auth_hash=97af1c18— pre-start hook fires ininstall_freshcurl localhost:8334→ HTTP 200 (bitcoin-ui),:8081→ 200 (lnd-ui),:50002→ 200 (electrs-ui)- OCI memory limits correctly applied: bitcoin-ui=128Mi, electrs-ui=128Mi, lnd-ui=64Mi (was emitted as 0 pre-fix)
Bugs fixed this session
parse_memory_limittruncation bug (732df1b8): lowercased "128Mi" → "128mi" →trim_end_matches('m')→ "128i" → f64 parse fails →None.unwrap_or(0)→ OCImemory.limit:0→ systemd rejects MemoryMax=0. 6 regression tests;create_containernow omits instead of emitting 0.archipelago.servicecgroup delegation missing (ba83f9bc): belt-and-bracesDelegate=memory pids cpu io.- ExtraHost
192.168.1.254(3ee192ba): see Post-Step 9 bug hunt above. - LND admin.macaroon unreadable (
be960023): see Post-Step 9 bug hunt above.
Commits made this session
3ee192ba fix(first-boot): use podman host-gateway magic for host.containers.internal
be960023 fix(lnd): read admin macaroon via sudo fallback
4b8ef0a0 docs: STATUS.md through Step 9 (.228 hot-swap verified)
ba83f9bc feat(systemd): delegate cgroup controllers to archipelago.service
732df1b8 fix: parse_memory_limit accepts Ki/Mi/Gi IEC binary suffixes
a0707f4d refactor: retire archipelago-reconcile.{service,timer} (Step 8a)
1c81a739 docs: split Step 8 into 8a/8b/8c
6e46932f docs: STATUS.md through Step 7
069bc4a5 feat: bitcoin-ui pre-start hook (Step 7)
Branch is 19 commits ahead of tx1138/main (local only — user pushes to mirrors personally).
Uncommitted state
Clean. Only untracked: tests/ (bats harness from prior session, not in scope), tmp-dump-spec.py (scratch).
Answered design questions (no need to re-ask)
- UI container naming →
archy-<app_id>for UIs only; existing bitcoin-knots/lnd/electrumx keep bare names - BITCOIN_RPC_AUTH injection → runtime bind-mount of nginx.conf (no build-args, no envsubst)
- Reconciler interval → 30 seconds
- Concurrency → per-app
Mutex<()>in aDashMap - Bash scripts → split into 8a/8b/8c; 8a done, 8b/8c deferred
- Step 4 extension →
ContainerOrchestratortrait includesinstall(app_id); themanifest_path-based install RPC stays dev-only - Step 7 bitcoin-ui template → embed via
include_str!, render on install + every reconcile, atomic tmp+rename to/var/lib/archipelago/bitcoin-ui/nginx.conf, bind-mount into container. RPC user hardcodedarchipelago, password from/var/lib/archipelago/secrets/bitcoin-rpc-password.
Context: which host is what
| Host | IP | Role | Dashboard pw | Sudo pw |
|---|---|---|---|---|
archy |
192.168.1.116 | Dev ThinkPad (Lenovo X250, Debian 13). Currently running v1.7.42-alpha (DEV_MODE). Step 10 target. | archipelago | ThisIsWeb54321@ |
archy228 |
192.168.1.228 | Kiosk HP ProDesk. Step 9 landing zone — now running Rust-orchestrator binary in prod mode. | password123 | archipelago |
Both are development alpha nodes — full destructive latitude, no need to ask before stop/start/rebuild.
Next action
Step 10 — Hot-swap on .116.
Unlike .228 (which tested the INSTALL path for net-new UI containers), .116 tests the ADOPTION path: it already has all three UIs and all backend containers running from prior v1.7.42-alpha runs. We want to verify the new prod orchestrator adopts every existing container without recreating or restarting them.
Steps:
- Disable DEV_MODE on .116 (check if override.conf exists —
/etc/systemd/system/archipelago.service.d/) - Stage the already-built binary at
~/Projects/archy/core/target/release/archipelago→/usr/local/bin/archipelago.new - Ensure
/opt/archipelago/apps/{bitcoin-ui,electrs-ui,lnd-ui}/manifest.ymlpresent (copy from repo) - Ensure
/opt/archipelago/docker/bitcoin-ui/matches the Step-7 layout (no baked nginx.conf) - Snapshot:
podman ps -a --format "{{.Names}}\t{{.Status}}\t{{.CreatedAt}}"→ save to/tmp/pre-step10-containers.txt systemctl stop archipelago→ install binary →systemctl start archipelago- Verify in journal: every running container appears in "Adopted N existing container(s)"; no container was recreated; all HTTP smokes still 200; BootReconciler reaches NoOp on every app_id after one pass.
- If broken → restore
.bakbinary, re-enable DEV_MODE override. - Commit STATUS.md update.
Risk on .116: If adoption fails mid-flight, we'd lose the running v1.7.42 backend that I'm currently typing at. Keep a second SSH session open to the ThinkPad for emergency revert. The backup plan is install /usr/local/bin/archipelago.bak /usr/local/bin/archipelago && systemctl restart archipelago.
After Step 10 we are blocked on Step 8b (multi-day manifest ports) before Step 11 (chaos matrix).
Why Step 8 got split (discovered 2026-04-23)
Original plan was one commit "delete bash + edit ISO builder". But on investigation:
first-boot-containers.shcreates 30+ containers with per-container logic (wallets, DB init, rpcauth derivations, post-create health waits). The repo only has manifests for 3 (bitcoin-ui, electrs-ui, lnd-ui from Step 7). Deleting bash now = brick first-boot on fresh installs.- Script also does non-container setup: secret generation (RPC pw, DB pw, FileBrowser admin pw), UID-mapping chowns for rootless podman subuid, Tor hostnames dir, WireGuard, firewall rules, nostr-relay dir. None of this lives in the Rust orchestrator.
update.rs(OTA update RPC) invokesreconcile-containers.shat two sites. Deleting the script breaks package updates. Porting those call sites to the orchestrator needs all containers to have manifests.- Design doc §505 updated to split 8 → 8a/8b/8c. Only 8a (delete the reconcile systemd unit + timer, BootReconciler covers) is safe to execute before we port manifests.
Archipelago — Current State, Plan, and Releases
Updated: 2026-04-22
This is the "pick this up tomorrow" page. One-stop summary of where we are, what the plan is, and what's shipped. Detailed plan lives in bulletproof-containers.md.
Current state
Fleet status
All four Gitea mirrors are synced to v1.7.40-alpha:
| Mirror | Host | Status |
|---|---|---|
| tx1138 | https://git.tx1138.com | ✅ v1.7.40-alpha live |
| gitea-local | http://localhost:3000 | ✅ v1.7.40-alpha live |
| .160 | http://23.182.128.160:3000 | ✅ v1.7.40-alpha live (Gitea recovered via podman system renumber — see below) |
| .168 | http://146.59.87.168:3000 | ✅ v1.7.40-alpha live |
Fleet test nodes:
| Node | Version | State |
|---|---|---|
| .103 (dev) | 1.7.40 | running, being developed against |
| .116 (this box) | 1.7.40 | healed manually via systemd-run chmod 755 /opt/archipelago/web-ui after v1.7.38/39 bug |
| .198 | 1.7.39 → 1.7.40-alpha | healed manually |
| .228 (primary test) | 1.7.40-alpha | healed manually; bitcoin-core + lnd + electrumx running; UI companions currently missing; bitcoin.conf rpcauth patched live |
| .249 (ISO test) | unreachable today | |
| .253 | 1.7.39 → 1.7.40-alpha | healed manually |
Known open issues (drives the plan below)
- UI companion containers disappear on .228 after daemon restarts — no auto-recreate (fixed by v1.7.45 Quadlet migration)
- bitcoin.conf rpcauth drifts from canonical secret → ElectrumX "Daemon connection problem" (fixed by v1.7.43 reconcile::derived)
host.containers.internalresolves to LAN gateway inside containers on some versions (fixed by v1.7.42 containers.conf)- Podman state DB loss requires manual recovery (fixed by v1.7.44 startup self-heal)
- LND "Connect Wallet" info vanishing after crashes — symptom of the same drift class as #2
- ElectrumX not syncing on .228 — downstream of #2; will resolve when bitcoin.conf is reconciled
Recent field incident (2026-04-22)
- Shipped v1.7.38 + v1.7.39, both broke nginx fleet-wide because the frontend tarball's root dir was
drwx------(700). Every node that OTA'd got 500 errors on every page. - Root-cause fix shipped in v1.7.40 (
create-release-manifest.shchmod + pre-ship assertion thattar tvzf | head -1showsdrwxr-xr-x). - .160 Gitea was down all day (502) because its rootless podman's
libpod/bolt_state.dbhad vanished. Recovered via clearing/run/user/$UID/{containers,libpod,podman}+podman system renumber. - Full failure-mode audit is in
bulletproof-containers.md.
Plan
We're shipping a level-triggered reconciler + Quadlet architecture over six incremental releases. Each release closes one failure mode. See bulletproof-containers.md for the full design, code layout, test harness, chaos matrix, sources.
Release roadmap
| Release | Closes | What lands | Status |
|---|---|---|---|
| v1.7.41 | FM5 (bad OTA nginx 500) | Post-OTA auto-rollback. New binary probes https://127.0.0.1/ on boot; if non-200 within 90s, restores web-ui.bak + calls rollback_update() + restarts |
in flight — deploying to .228 for test |
| v1.7.42 | FM4 (host.containers.internal wrong) |
/etc/containers/containers.conf w/ host_containers_internal_ip = 10.89.0.1; every container gets --add-host=host.archipelago:10.89.0.1 |
pending |
| v1.7.43 | FM2 (config drift) | reconcile::derived::render_bitcoin_conf — pure fn over canonical secret, rewrites on drift. Same for lnd.conf |
pending |
| v1.7.44 | FM6 (podman state loss) | Startup probe detects broken podman state, auto-recovers via /run/user/$UID/* clear + system renumber |
pending |
| v1.7.45 | FM1 + FM3 (companion orphans) | archy-bitcoin-ui → Quadlet .container unit in /etc/containers/systemd/. systemd (not archipelago) owns it |
pending |
| v1.7.46 | — | archy-lnd-ui → Quadlet |
pending |
| v1.7.47 | — | archy-electrs-ui → Quadlet |
pending |
| v1.7.48+ | all (full daemon refactor) | core/archipelago/src/reconcile/ module replaces imperative install.rs container management. Main app containers become Quadlet too |
pending |
Test harness (bats + Goss + Chaos Toolkit + vmtest) lands scaffold in v1.7.41, first lifecycle tests blocking v1.7.45, full matrix blocking beta tag.
Release history
v1.7.41-alpha — IN FLIGHT — 2026-04-22
Post-OTA auto-rollback. After an update lands, the node probes its own web UI through nginx — if the frontend isn't answering cleanly within 90 seconds, the node automatically rolls back to the previous version and restarts. A bad release can no longer leave the fleet stranded on an unreachable node.
Changes:
core/archipelago/src/update.rs:PendingVerificationstruct, write marker before service restart,verify_pending_update()on new binary boot — probeshttps://127.0.0.1/, on fail restoresweb-ui.bak+ callsrollback_update()+systemctl restart archipelagocore/archipelago/src/main.rs: startup task invokes verifier concurrently with server
v1.7.40-alpha — 2026-04-22
Proper fix for the 500 error. Fixed the v1.7.38/39 tarball-perms bug at its source — staging dir is now explicitly chmod 755 before tar; --mode=u=rwX,go=rX normalizes archive perms; pre-ship assertion aborts release if tar tvzf | head -1 isn't drwxr-xr-x.
Changes:
scripts/create-release-manifest.sh: pre-tar chmod + tar --mode flag + post-tar verify- Everything from .38 + .39 still in place (onboarding auto-heal, silent logins, app purge, AIUI in tarball)
v1.7.39-alpha — 2026-04-22
Hotfix attempt for v1.7.38's nginx 500 (didn't fully work — still shipped broken tarball perms). Added startup self-heal chmod in main.rs and post-extract chmod in update.rs OTA applier.
v1.7.38-alpha — 2026-04-22
Onboarding auto-heal + silent logins + App Store trim.
Changes:
auth.rs:is_onboarding_complete()auto-heals fromsetup_complete+password_hash(prevents clear-cache → onboarding wizard bug)useOnboarding: tri-state — backend-unreachable no longer defaults to/onboarding/intro- Login sounds gated by
isFirstInstallPhase()— silent after onboarding, typing sounds unaffected - Removed FIPS app, Nostr Relay, Nostr VPN, Routstr, Penpot from catalog + Rust + docker + icons
- Deleted 15 image versions from tx1138, .168, gitea-local registries
- AIUI baked into release tarball via
demo/aiui/ prebuildhook syncsapp-catalog/catalog.json→public/catalog.json
(Shipped with tarball-perms bug; fleet had to be healed before v1.7.40.)
v1.7.37-alpha — 2026-04-22
Bitcoin Core install fixes + dynamic node UI + full-archive default.
- Bitcoin Core passes explicit
-rpcbind/-rpcallowip/etc.CLI args so vanilla image exposes RPC - Split
bitcoin-corefrombitcoin-knotsin backendAppMetadata - bitcoin-ui auto-detects Core vs. Knots from subversion, swaps branding at runtime
- Storage (Full Archive · X GB / Pruned) indicator on dashboard
- Node Settings modal shows real values (network, storage, txindex, ZMQ, RPC port)
- Pull fallback to
docker.iowhen no mirror carries the image - Removed
prune=550hardcode — full archive default
Key docs
bulletproof-containers.md— full reconcile architecture, code layout, test matrix, chaos scenarios, sourcesBETA-RELEASE-CHECKLIST.md— existing beta checklistBETA-ISSUES-20260328.md— prior beta-blocker trackinghotfix-process.md— release workflowarchitecture.md— system architecture overview
How to resume
- Check fleet mirrors are all live:
curl -sS https://git.tx1138.com/lfg2025/archy/raw/branch/main/releases/manifest.json | jq .version - Read
bulletproof-containers.mdfor the current plan - Check task list (
/listor via Claude Code) for the in-flight release - Latest in-flight work: v1.7.41 deploying to .228 for test; will ship to all 4 mirrors once verified