Files
archy/docs/RESUME.md
2026-06-11 00:24:54 -04:00

69 KiB

RESUME - Archipelago Release Hardening on .198

Last updated: 2026-06-10

2026-06-10 05:48 EDT Active Session Checkpoint

Work resumed from docs/NEXT_TERMINAL_HANDOFF.md. No .198 host actions have been run yet in this resumed pass.

Current first steps:

  1. Rerun git diff --check.
  2. Rerun the focused Rust image-version test for the Nextcloud false-update helper.
  3. If those are clean, inspect and continue the rootless Podman lifecycle/ scanner-backoff work before any .198 validation.

Progress:

  • git diff --check passed.
  • Focused Rust image-version test in /tmp/archy-cargo-image-versions remains inconclusive: the tool PTY stayed open after compile output stopped, with no active cargo, rustc, or linker process visible.
  • Bounded retry of the focused image-version test using the normal workspace target also timed out: timeout 300s cargo test --manifest-path core/Cargo.toml -p archipelago container::image_versions::tests exited 124 after compiling the archipelago test target without reaching test output. Nextcloud false-update validation is still not closed.
  • Local code change in progress: single-orchestrator package.stop now returns immediately with stopping and runs the orchestrator stop in the background, instead of blocking the RPC/UI while Podman cleanup happens.
  • cargo fmt --manifest-path core/Cargo.toml --all --check passed.
  • Compile check passed in /tmp/archy-cargo-runtime-check: cargo check --manifest-path core/Cargo.toml -p archipelago --bin archipelago.
  • git diff --check passed after the stop-path edit and doc updates.
  • Lower-level stop path inspection: Quadlet service stop is already bounded with kill/reset recovery, and the runtime fallback treats already-absent containers as success. No extra lower-level stop change was made.

2026-06-10 05:30 EDT Pause Checkpoint

User paused to switch machines. Continue from /home/archipelago/Projects/archy and read docs/NEXT_TERMINAL_HANDOFF.md plus docs/1.8-alpha-improvements-tracker.md first. No dev server or validation command should be intentionally left running from this checkpoint.

Latest local-only tracker progress:

  • Done: uninstall preserve/delete-data choice, companion APK QR/download modal, App Details setup-instructions card, dead/coming-soon UI cleanup via Spotlight AI placeholder removal.
  • In progress: Fleet/tab loading polish, Bitcoin receive-address readiness states, no-registration credentials inventory, Nextcloud false-update fix.
  • New credential fallback: PhotoPrism now shows manifest-backed credentials (admin / archipelago) when backend credentials are empty. Grafana was not added because GRAFANA_ADMIN_PASSWORD is not resolved to a known repo default/secret.
  • Nextcloud local fix: manifest/catalog/UI metadata now points at nextcloud:29 and image update detection ignores registry-host-only changes. Catalog drift passed, but backend focused Rust validation did not complete cleanly. First cargo test -p archipelago container::image_versions::tests from core/ hit a Rust linker/incremental artifact failure while /tmp was full; a non-incremental retry was killed after running too long. Old /tmp/archy-cargo-* build-cache directories were removed and /tmp recovered.

Latest local validations:

  • npm run type-check passed after the PhotoPrism credential fallback.
  • npm test -- --run src/views/apps/__tests__/appCredentials.test.ts passed.
  • git diff --check passed after the Spotlight cleanup and should be rerun after resuming.
  • python3 scripts/check-app-catalog-drift.py --release --strict passed during the Nextcloud pass.

Immediate next steps:

  1. Rerun git diff --check.
  2. Rerun cargo test -p archipelago container::image_versions::tests from core/ when ready to validate the Nextcloud update-detection helper.
  3. Continue the docs/1.8-alpha-improvements-tracker.md rows that remain todo or in-progress, avoiding host-gated items until .198 access is intentionally resumed.

2026-06-09 Resume Handoff - Read First

Last user prompt to preserve:

please can we save all our progress, backlog, and goal to memory so I can resume on another device please

including the last prompt

Ultimate release goal:

Archipelago's app/container system must be developer-ready and production-release ready. New apps should be supported through manifest/runtime contracts and clear developer documentation, not one-off OS-level changes or fragile per-app hacks. The app system must be professional, secure, elegant, lightweight, and predictable: apps install, start, stop, restart, uninstall, reinstall, survive reboot, show correct status/progress, and launch correctly from tabs/iframes. Developers should be able to package apps for Archipelago clearly from the migration/developer docs.

Important target node:

  • Validation node: archipelago@192.168.1.198, password password123.
  • Current release deadline pressure from user: production release target was Thursday, 2026-06-11.
  • Tests have been run mostly on .198; user noted we may also need to validate on the current intended release server, not only .198.
  • Avoid broad/destructive Podman store cleanup. Do not use git reset --hard or revert unrelated user changes.

Current deployed backend on .198:

  • Latest deployed /usr/local/bin/archipelago sha256: 9a00e5432dd9241a9a54087cc87ede46fc0c77a5051dbfb2d34112b9b12e902f.
  • A later local-only code change exists and passed cargo check: cached web-app health now requires HTTP reachability, not just TCP. This was not deployed because the user interrupted the release build/deploy flow. No build process was left running at handoff.

Major progress achieved in the latest session:

  • Beta Telemetry / Fleet collector:
    • Confirmed TELEMETRY_COLLECTOR_URL was not set in the current shell and no repo/service config was setting it.
    • Fixed the periodic reporter to POST a telemetry.ingest JSON-RPC envelope to the configured collector endpoint instead of POSTing the raw telemetry report body.
    • Added optional systemd env loading with EnvironmentFile=-/var/lib/archipelago/telemetry.env in image-recipe/configs/archipelago.service.
    • Updated scripts/deploy-to-target.sh so deployments write /var/lib/archipelago/telemetry.env when TELEMETRY_COLLECTOR_URL is exported in scripts/deploy-config.sh.
    • Documented the expected value shape in scripts/deploy-config.example: https://<collector-host>/rpc/v1.
    • Verification passed: cargo fmt -p archipelago --manifest-path core/Cargo.toml, bash -n scripts/deploy-to-target.sh, git diff --check for the touched files, and CARGO_TARGET_DIR=/tmp/archy-cargo-check cargo check -p archipelago --manifest-path core/Cargo.toml.
    • systemd-analyze verify image-recipe/configs/archipelago.service could not run in the sandbox because systemd bus access failed with SO_PASSCRED failed: Operation not permitted.
    • Still needed: choose the real collector host, create or update local scripts/deploy-config.sh with export TELEMETRY_COLLECTOR_URL='https://<collector-host>/rpc/v1', deploy, restart archipelago, and confirm opted-in nodes ingest into Fleet.
  • IndeeHub:
    • Recovered stale/corrupt metadata/container state enough for fresh lifecycle.
    • Full lifecycle passed earlier on .198.
    • Verified launch on 7778.
    • Verified /nostr-provider.js is served and the Nostr signer bridge requirement is preserved.
  • Saleor:
    • Removed from app catalog/server as requested.
  • Bitcoin Knots / Bitcoin UI:
    • Fixed false health path so bitcoin-knots health no longer just probes the UI bridge on 8334.
    • Patched Bitcoin UI wording to show retrying/busy sync states instead of scary permanent failure.
    • Verified /bitcoin-status recovered; node is in IBD and pruned, progress around 6-7% during latest checks.
  • Fedimint:
    • Restored/kept Fedimint Gateway as separate catalog app. Do not make Guardian launch Gateway.
    • Fixed Guardian startup path so fedimint uses manifest-backed Quadlet/orchestrator, not legacy startup.
    • Fixed generated unit regeneration by removing the pre-orchestrator Podman inspect gate for orchestrator starts.
    • Fedimint Guardian unit now includes FM_BITCOIND_URL=http://bitcoin-knots:8332.
    • Added manifest wrapper that waits for Bitcoin RPC sync with "initialblockdownload":false before launching fedimintd.
    • Current correct behavior on .198: fedimint.service active and logging Waiting for Bitcoin RPC sync at http://bitcoin-knots:8332...; RPC health returns starting; container-list now reports fedimint as starting instead of stale stopping.
    • Guardian iframe/tab does not yet show UI because fedimintd is intentionally gated until Bitcoin leaves IBD. The UI should explain "waiting for Bitcoin sync" rather than opening a blank/dead iframe.
  • BotFights:
    • User reported stopped/unhealthy.
    • Added botfights to manifest-backed orchestrator start path so it no longer fails immediately on legacy Podman discovery.
    • Deployed backend hash 9a00e543....
    • BotFights started and is active.
    • Direct checks after it finished booting: / returned HTTP 200; /api/health returned {"status":"ok","name":"botfights"}.
    • Note: .198 manifests still use git.tx1138.com/lfg2025/botfights:1.1.0; local repo manifest shows 146.59.87.168:3000/lfg2025/botfights:1.1.0. Reconcile this catalog/manifest mismatch later.
  • Status/health correctness:
    • Reduced container health/status Podman timeouts to avoid UI hanging forever.
    • container-list now refreshes stale cached states and uses Quadlet service-active fallback for stale stopping states.
    • Fedimint stale stopping fixed to starting.
    • Local-only patch passed cargo check: web-app cached health requires HTTP success/redirect, not just open TCP. This fixes false healthy during app boot, seen with BotFights.
  • Filebrowser/Home Assistant/Immich/Bitcoin:
    • Latest RPC health check showed filebrowser healthy, homeassistant healthy, immich healthy, bitcoin-knots healthy.
    • Still treat Home Assistant setup/restart hang and Immich post-setup HTTP 500 as backlog blockers needing focused validation.

Current critical blockers:

  • Runtime control plane / Podman scanning:
    • Backend restarts repeatedly take 1-2 minutes because startup/crash recovery synchronously waits on slow podman ps.
    • Logs show repeated podman ps -a --format json timed out after 30s and crash recovery podman ps stopped timed out after 60s.
    • This is causing bad UX: "checking forever", false "no apps installed", intermittent "loading apps", stale statuses, slow lifecycle actions.
    • Next platform fix should move Podman/crash-recovery scans out of the service readiness path and keep last-known app state during scanner backoff.
  • My Apps UI false negatives:
    • User reports apps sometimes do not show, "checking" forever, "loading apps" sometimes good but often false "no apps installed".
    • Required fix: do not show empty/no-apps while scanner or Podman is in backoff. Keep last known apps, show explicit loading/checking/stale state, and avoid destructive UI conclusions from scan timeout.
  • Fedimint Guardian:
    • Current "starting/waiting for Bitcoin sync" is correct while Bitcoin is in IBD.
    • Need UI/status copy that explains waiting for Bitcoin sync, and later validate Guardian UI on 8175 once Bitcoin sync condition is satisfied.
  • Progress UX:
    • User explicitly requires install/uninstall/start/stop/restart progress to be accurate and not look frozen.
    • Uninstall indicator currently poor/no progress. Must fix with clear phase updates and no stale notifications.
  • Stale health notifications:
    • Must not persistently trigger on new logins/refreshes after no longer valid.
    • Some UI filtering was patched earlier, but keep this in regression backlog.
  • Reboot survival:
    • Must pass repeated reboot validation after runtime/status fixes.
    • Acceptance target from user: minimum 3 clean consecutive reboots, preferably 5.

Backlog captured from user reports:

  • Portainer:
    • Environment wizard error: Dial unix /var/run/docker.sock: connect: connection refused.
    • User noted Portainer does Podman orchestration well; compare/learn from its socket/control flow where useful.
  • Fedimint:
    • Setup after guardian confirmation caused app not to launch.
    • Guardian launch was opening Gateway before; do not regress. Guardian and Gateway must remain distinct.
    • Gateway app disappeared from catalog before; it has been restored but keep in regression tests.
  • Bitcoin Knots:
    • User saw missing app/launch issues and status bridge messages. UI now improved, but include in lifecycle/reboot regression.
  • Home Assistant:
    • Setup has issues on this node and restart hung for a long time.
  • Immich:
    • After setup user saw HTTP 500 stacktrace from loadServerConfig. Needs focused post-setup validation, not just "healthy".
  • Filebrowser:
    • User saw erroneous stopped status while app was working. Status ordering was patched; keep in regression.
  • Tailscale:
    • Launch must show local login/auth UI, not merely container running.
  • BTCPay/Fedimint/Gateway/other Bitcoin-dependent apps:
    • Need clearer dependency wait states when Bitcoin RPC is slow/IBD.
  • App catalog/developer readiness:
    • Apps should not require OS-level changes per app.
    • App migration document and developer guide must include this principle and current app packaging contract.
  • Saleor:
    • Removed from catalog/server and should stay removed unless intentionally reintroduced.

Release readiness estimate:

  • Prior estimate was 68%; after latest IndeeHub/Fedimint/BotFights/status progress, a realistic estimate is about 72%.
  • Remaining 28% is not feature volume; it is systemic hardening: runtime control-plane responsiveness, truthful UI during Podman backoff, lifecycle/reboot gates, and focused app-specific post-setup validation.

Suggested immediate next steps after resuming:

  1. Read this file and verify no background build/process is running.
  2. Build/deploy the local-only HTTP-health tightening patch if not already deployed.
  3. Patch backend startup/crash recovery so Podman scans are async/non-blocking and service readiness is not held hostage by podman ps.
  4. Patch My Apps UI/data flow to preserve last-known apps during scanner backoff and never show false empty state while checking.
  5. Run focused status checks on .198: fedimint, botfights, filebrowser, bitcoin-knots, immich, homeassistant, portainer.
  6. Continue lifecycle gates only after the runtime scan/control path is stable enough that tests measure apps, not Podman timeouts.

Read this first if resuming in a fresh OpenCode session. Paste the resume prompt below verbatim.


Resume Prompt

Continue Archipelago release hardening from docs/RESUME.md. First read docs/RESUME.md, docs/CONTAINER_LIFECYCLE_HANDOFF.md, and docs/MIGRATION_STATUS_REPORT.md. The active validation node is .198 at 192.168.1.198; keep archipelago-doctor.timer and archipelago-reconcile.timer inactive for deterministic tests. Do not run Podman prune/image-list/system-df/image-exists/store-wide cleanup commands on .198; the store is known to hang under load. Preserve app data. Latest deployed backend hash is f1f5c61c9f66ae58e3cb0c7f1cb390777814d162345685c1ddec099057ba2fe3. This includes the rootless Podman socket fix that treats /run/user/1000/podman/podman.sock as a socket bind, never a directory/data bind, prefers persistent podman-archy-api.service for Portainer, and changes absent cached Stopping entries to Stopped. User reported host reboot validation was not clean: many containers were SIGKILLed during reboot/shutdown and IndeeHub was stopped after boot. User also reported Immich, IndeeHub, Tailscale, Vaultwarden, Portainer, Home Assistant, Uptime Kuma, Nextcloud, Fedimint, and Botfights app lifecycle/launch/state issues. BTCPay was a false alarm: slow but fine. Current live validation: Vaultwarden full preserve-data lifecycle passed; Portainer full preserve-data lifecycle passed and its socket mount is no longer //deleted, but the user still needs to retry the Portainer environment wizard. Fedimint direct container state is running/healthy. IndeeHub remains P0: Podman still has a corrupted indeedhub|Removing|97cf9fd13bb2 record; targeted podman rm -f, podman rm -f --time 0, and podman container cleanup --rm indeedhub hang and must be killed. Treat post-reboot recovery, launch reachability, lifecycle correctness, progress indication, and rootless Podman socket-backed apps as active release blockers. IndeeHub is not passing unless http://<node>:7778/ is reachable and /nostr-provider.js is injected/served so the Nostr signer works as before. Tailscale is not passing unless launch presents the Tailscale login/auth UI. Before editing or touching .198, summarize current state and your exact first step.


Current Goal

Cut Archipelago 1.8-alpha, including a ready-to-test ISO image.

Current status estimate: about 68% of the way to release. The app migration, manifest/catalog generation, and many local gates are advanced, and the latest pass fixed Vaultwarden plus the concrete Portainer stale socket mount. Live .198 testing still shows the app platform is not production-bulletproof. Remaining release blockers include app install/start truthfulness, frontend launch readiness gating, IndeeHub recovery and Nostr signer compatibility, Tailscale login-link launch, Home Assistant/Uptime Kuma/Nextcloud install/start failures, full lifecycle coverage, progress indication quality, app packaging documentation, refactor/dead-code cleanup, repeated reboot validation, final .198 lifecycle confidence, and cutting/smoke-testing the 1.8-alpha ISO.

Release Readiness Estimate

  • Estimated completion: 68%.
  • What is already achieved:
    • manifest-driven app migration is substantially advanced;
    • catalog metadata generation and strict drift checks are green;
    • local backend/frontend release gates have been green in prior passes;
    • broad non-destructive lifecycle has passed on the deployed release-candidate line before the reboot-gate finding;
    • Podman store-risk paths have been quarantined from known fragile broad image/store commands;
    • IndeeHub recovery now has local hardening in progress, including explicit Nostr signer validation in the lifecycle harness;
    • targeted Immich fixes now make dependency creation fail fast instead of silently reporting install success, and a follow-up readiness-gating patch is in progress so the app does not look launchable before HTTP readiness;
    • mobile and desktop app progress UX now has clearer install/remove phase labels in local changes;
    • Vaultwarden full preserve-data lifecycle passed on .198 after the rootless socket fix;
    • Portainer full preserve-data lifecycle passed on .198 after recreating the container against persistent podman-archy-api.service; its mount now points at /podman/podman.sock, not /podman/podman.sock//deleted.
  • What must still pass before release:
    • deploy the current Immich readiness-gating backend and frontend progress UX changes;
    • focused Immich validation: install must stay in progress until http://<node>:2283/ returns HTTP success and app launch opens the frontend;
    • focused IndeeHub validation: recover stale/corrupt frontend container, prove http://<node>:7778/, and prove /nostr-provider.js signer bridge is injected/served;
    • keep Vaultwarden in regression coverage even though the latest full lifecycle passed;
    • focused Tailscale validation: launch must present the local login/auth link/UI on 8240;
    • focused Portainer validation: user must retry the environment wizard and confirm it can connect to the rootless Podman socket at /var/run/docker.sock;
    • full preserve-data lifecycle testing for representative migrated apps and key stacks: install -> launch -> stop -> start -> restart -> uninstall preserve_data -> reinstall -> launch;
    • progress indication validation for install, uninstall, start, stop, restart, reboot recovery, and failed transitions; generic "running" or "removing" pills are not enough;
    • app packaging documentation gate: update docs/APP-PACKAGING-MIGRATION-PLAN.md and docs/app-developer-guide.md so they match the current manifest/runtime contract, include lifecycle/progress/reboot expectations, and clearly tell developers to use reusable manifest/orchestrator primitives instead of OS-level per-app hacks;
    • required refactor/remove-dead-code gate: after correctness is proven and before cutting 1.8-alpha, remove obsolete app-specific paths, stale fallback metadata, duplicate lifecycle logic, unused scripts/hooks, and misleading compatibility shims; rerun lifecycle, launch, and release gates afterward;
    • broad non-destructive lifecycle after the deploy;
    • at least 3 consecutive clean post-fix reboot iterations, with broad lifecycle green after each;
    • preferably 5 consecutive clean reboot iterations before calling 1.8-alpha production-release ready;
    • final local release gates after any additional fixes;
    • cut the 1.8-alpha ISO;
    • boot/smoke-test the ISO enough to prove installability, backend startup, UI startup, app catalog availability, and at least a focused app lifecycle.

Latest User Directive

A lot were killed SIGKILL and one crashed, a couple stopped. Not sure if we did fixes but we should be a few reboot tests until 3/4/5 reboots are clean I guess, unless you advise a different passing criteria

please do not forget that indeehub must work with the nostr signer just like before, I hope we haven't broken that or anything, please add to tasks

also please note that immich and tailscale are not launching on the front-ends on their ports from the app screen, they say running/healthy but clearly aren't

Also BTCPay is not running either

no my bad, wrong server, BTCPay is fine just slow, please continue

Yes, as shown in trying to complete the environment wizard in portainer you get "Failure Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?"

please confirm there is a refactor/remove dead code release gate too

Passing criterion adopted: after the post-reboot recovery fix is deployed, require at least 3 consecutive clean reboots with broad non-destructive lifecycle green after each; prefer 5 consecutive clean reboots for production-release confidence. SIGKILL during shutdown is not automatically disqualifying if every managed app recovers and is reachable after boot, but any app left stopped/crashed/unreachable after boot is a failed reboot iteration. IndeeHub validation must include the Nostr signer bridge, not just HTTP reachability.

Immich, Tailscale, Vaultwarden, and Portainer are explicit blockers. Container running/healthy is not enough for Immich/Tailscale; direct/app-screen launch routes must work. Tailscale launch must present the login/auth UI. Vaultwarden must survive install/start/restart. Portainer must be able to talk to the rootless Podman socket from inside its Docker-compatible socket bind. BTCPay is not currently a blocker; it was a wrong-server/slow-app false alarm.

There is also an explicit app packaging documentation gate and an explicit required refactor/remove-dead-code release gate. The packaging docs must be current enough for a third-party developer to package an app against the actual manifest/runtime contract. Do the refactor/dead-code cleanup after current correctness fixes are validated, not before, but do not cut 1.8-alpha without it: remove stale per-app hacks, dead legacy code paths, duplicate lifecycle helpers, obsolete scripts/hooks, and misleading fallback metadata that would make 1.8-alpha hard to maintain, then rerun the release gates.


Live .198 State

  • Host: 192.168.1.198.
  • Password for lifecycle harness/RPC login: password123.
  • Latest recorded /usr/local/bin/archipelago sha256: f1f5c61c9f66ae58e3cb0c7f1cb390777814d162345685c1ddec099057ba2fe3.
  • archipelago.service: active.
  • archipelago-doctor.timer: inactive.
  • archipelago-reconcile.timer: inactive.
  • /: 65% used, about 9.6G free.
  • /var/lib/archipelago: about 9-10% used, about 370G free.

Current active app blockers:

  • Immich: after deploying hash 54d781..., reinstall no longer immediately stops. Live test showed immich_postgres and immich_redis healthy and immich_server running; first launch had a readiness gap while Immich ran migrations/geodata import, then 2283 returned HTTP 200. Local follow-up changes add an Immich server health check and require healthy status before install completes.
  • IndeeHub: still blocked. Latest targeted check after hash f1f5c61c... showed a corrupted Podman ghost record: indeedhub|Removing|97cf9fd13bb2; podman inspect indeedhub fails with layer not known. Targeted podman rm -f, podman rm -f --time 0, and podman container cleanup --rm indeedhub hang and must be killed. Must recover this record without broad store cleanup and then verify http://<node>:7778/ plus /nostr-provider.js for the Nostr signer.
  • Home Assistant: user reports install completes then app stops. Treat as part of the migrated single-container/rootless Podman control-plane blocker.
  • Uptime Kuma: user reports install takes ages then app stops. Live logs showed package.install uptime-kuma failed: systemctl --user restart podman.socket exited exit status: 1.
  • Nextcloud: user reports same install-then-stop behavior. Live logs showed package.install nextcloud failed: systemctl --user restart podman.socket exited exit status: 1.
  • Vaultwarden: latest full preserve-data lifecycle passed on hash 2a168489...: install -> launch on 8082 -> stop -> start -> restart -> uninstall preserve_data -> reinstall -> launch. Keep in regression tests because the user-visible transition/progress UX still looked like it was stuck while stopping.
  • Portainer: latest full preserve-data lifecycle passed on hash 2a168489.... The stale mount was confirmed as /run/user/1000/podman/podman.sock//deleted; after persistent podman-archy-api.service and Portainer recreate, mountinfo shows /podman/podman.sock without //deleted and http://127.0.0.1:9000/ returns HTTP 200. User still needs to retry the environment wizard; do not close this blocker until the wizard no longer reports Cannot connect to the Docker daemon at unix:///var/run/docker.sock.
  • Tailscale: still blocked. Container running is not enough; launch must present local login/auth UI on 8240.
  • Fedimint: user reported it showed stopping; after hash f1f5c61c..., direct targeted state shows fedimint|Up ... (healthy) and RPC container-list shows fedimint running. Keep in focused regression/launch checks.
  • Botfights: newly reported stopped/broken. Direct probe after the report showed botfights running/healthy and http://127.0.0.1:9100/ returning 200; keep in focused lifecycle/launch validation after Podman control-plane recovery.
  • Rootless Podman socket/control plane: improved but still a release-risk area. Fixed the concrete bug where /run/user/1000/podman/podman.sock could be created as a directory and the Portainer bind could point at a deleted socket inode. The current deployed backend prefers persistent podman-archy-api.service. Continue watching scanner timeouts and lifecycle behavior for Home Assistant, Uptime Kuma, Nextcloud, and Portainer.
  • Stuck Podman records: P0 migration blocker. IndeeHub proves ordinary targeted podman rm fallbacks are not sufficient once a record is wedged in Removing.
  • Progress UX: still blocked until live validation proves install/uninstall/start/stop/restart show phase detail and do not appear frozen.

Do not treat root disk pressure as a current blocker anymore. It was reduced from 99% used with under 600M free to about 65% used with roughly 10G free.

2026-06-10 Resume Continuation Checkpoint

  • Deployed backend hash 7f58da80063f58574675256913ac9cddf131e65d8935015748a70adffc228f83 to .198.
    • Previous live hash observed before deploy: 9a00e5432dd9241a9a54087cc87ede46fc0c77a5051dbfb2d34112b9b12e902f.
    • archipelago.service is active.
    • archipelago-doctor.timer and archipelago-reconcile.timer are inactive.
  • Added explicit release gates to this handoff:
    • app packaging docs must be updated before 1.8-alpha;
    • refactor/remove-dead-code is required before 1.8-alpha, after correctness validation and before final release gates/ISO.
  • Local validation before deploy:
    • bash -n tests/lifecycle/remote-lifecycle.sh passed;
    • cargo fmt --manifest-path core/Cargo.toml --all;
    • cargo test --manifest-path core/Cargo.toml -p archipelago-container passed (45 tests);
    • cargo check --manifest-path core/Cargo.toml -p archipelago -p archipelago-container passed;
    • python3 scripts/check-app-catalog-drift.py --release --strict passed;
    • git diff --check passed.
    • Filtered cargo test --manifest-path core/Cargo.toml -p archipelago --bin archipelago indeedhub appeared wedged in the tool PTY after compilation started; no local cargo/rustc worker remained visible. Treat that one filtered run as inconclusive, not failed.
  • IndeeHub live validation after deploy:
    • container-list reports indeedhub running;
    • container-health reports {"indeedhub":"healthy"};
    • http://192.168.1.198:7778/ returns HTTP 200;
    • http://192.168.1.198:7778/nostr-provider.js returns HTTP 200 and contains the Archipelago NIP-07/NIP-98 Nostr provider shim.
  • Immich live validation after deploy:
    • container-list reports immich running;
    • direct http://192.168.1.198:2283/ returns HTTP 200;
    • container-health reported {"immich":"unknown"} during one focused check, so health truthfulness still needs follow-up even though launch HTTP is reachable.
  • Tailscale live validation after deploy:
    • Found the live generated unit still used the stale catalog command sleep 2; tailscale web...; locally patched app-catalog/catalog.json, neode-ui/public/catalog.json, and scripts/first-boot-containers.sh to use the safer socket-wait startup, and copied the catalog to /opt/archipelago/web-ui/catalog.json.
    • App-scoped package.restart tailscale failed via RPC with podman ps timed out while listing containers.
    • Patched the live generated Tailscale .container unit to match the catalog fix and restarted only tailscale.service; the old container required SIGKILL during stop and Podman cleanup took roughly 2 minutes.
    • After restart, the Tailscale unit runs both tailscaled and tailscale web, container-list reports tailscale running, container-health reports {"tailscale":"healthy"}, and http://192.168.1.198:8240/ returns HTTP 200 with Tailscale UI content.
    • Do not close Tailscale lifecycle as fully passing yet: launch UI is fixed, but stop/restart behavior exposed the rootless Podman cleanup/control-plane blocker.
  • Other live probes after deploy:
    • portainer HTTP 9000 returns 200; user still needs to retry the environment wizard.
    • vaultwarden HTTP 8082 returns 200 from localhost on .198.
    • botfights HTTP 9100 returns 200 from localhost on .198.
    • btcpay-server returned 302 then timed out under a short probe; continue treating BTCPay as slow rather than a current blocker unless a focused check fails.
    • fedimint port 8175 reset during probe while RPC showed starting; keep expected Bitcoin-sync wait-state/status copy in scope.
  • Podman/control-plane remains the active systemic blocker:
    • logs still show podman ps timed out, podman stats timed out, scan backoff, and slow app cleanup;
    • do not start reboot-count validation until app stop/start/restart and post-reboot recovery are clean enough that tests measure app behavior instead of Podman timeouts.

Latest Completed Work

2026-06-08 Rootless Socket, Vaultwarden, and Portainer Fix

  • Built and deployed backend hash 2a168489737180b4088503dd93ef89c11da13e64790b324db8baea8ca05d3536 to .198; then built and deployed follow-up hash f1f5c61c9f66ae58e3cb0c7f1cb390777814d162345685c1ddec099057ba2fe3; archipelago.service active, archipelago-doctor.timer inactive, archipelago-reconcile.timer inactive.
  • Fixed rootless Podman socket bind handling in core/archipelago/src/container/prod_orchestrator.rs:
    • /run/user/1000/podman/podman.sock is skipped by bind-directory creation and data UID/chown prep;
    • socket bind mounts call explicit socket repair before other bind prep;
    • ensure_user_podman_socket() now prefers persistent podman-archy-api.service at unix:///run/user/1000/podman/podman.sock, falling back to podman.socket only if needed.
  • Validated locally before deploy:
    • cargo fmt --manifest-path core/Cargo.toml --all.
    • cargo check --manifest-path core/Cargo.toml -p archipelago -p archipelago-container.
    • cargo test --manifest-path core/Cargo.toml -p archipelago --bin archipelago absent_ (4 passed, including the stale absent Stopping regression tests).
    • git diff --check.
    • timeout 900s cargo build --manifest-path core/Cargo.toml -p archipelago --release.
  • Vaultwarden full preserve-data lifecycle passed on .198:
    • ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=vaultwarden ARCHY_FULL_LIFECYCLE=1 ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh
  • Portainer full preserve-data lifecycle passed on .198:
    • ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=portainer ARCHY_FULL_LIFECYCLE=1 ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh
  • Portainer stale socket mount was confirmed and repaired:
    • Before recreate, mountinfo showed /run/user/1000/podman/podman.sock//deleted -> /var/run/docker.sock.
    • After persistent podman-archy-api.service and Portainer recreate, mountinfo shows /podman/podman.sock -> /var/run/docker.sock, host socket exists, and Portainer UI returns HTTP 200.
    • User still needs to retry the Portainer environment wizard; do not close the blocker until that wizard can connect.
  • Direct state check after deploy:
    • fedimint|Up ... (healthy) and RPC container-list shows fedimint running.
    • indeedhub|Removing|97cf9fd13bb2; podman inspect fails with layer not known; targeted removal/cleanup hangs and had to be killed.
    • vaultwarden running true.
    • portainer running true.

2026-06-08 Reboot Blocker Follow-up In Progress

  • User reported host reboot validation was not clean: many containers were killed with SIGKILL during reboot/shutdown, one crashed, a couple stopped, and IndeeHub was stopped after boot.
  • Treat this as a failed reboot gate. Do not call the release ready until post-fix reboot iterations are clean.
  • Local changes made in this pass:
    • hardened core/archipelago/src/container/prod_orchestrator.rs IndeeHub stack recovery so reboot reconcile starts existing backend containers through a user scope when possible, waits for backend containers and API dependency DNS, starts/restarts the frontend, verifies it remains running, and verifies host port 7778;
    • hardened core/container/src/manifest.rs package validation for app IDs, ports, env keys, capabilities, devices, volume sources/options, network policy, and reviewed host-bind exceptions while preserving all current real manifests;
    • updated tests/lifecycle/remote-lifecycle.sh so IndeeHub launch validation requires /nostr-provider.js to be injected into the HTML and served from the app, preserving the Nostr signer requirement.
  • Deployed follow-up backend hash 4108ca146b482c028ae8d7c4bec314b71ef3412f15efd2e61846a2c345b36aba to .198; service active, timers inactive. Focused audit still showed:
    • indeedhub stuck stopping and unhealthy;
    • immich stopped/unhealthy;
    • tailscale running/healthy but direct launch 8240 returned 000;
    • vaultwarden health RPC errored and launch 8082 returned 000;
    • btcpay-server was fine (23000 returned HTTP 200); user confirmed BTCPay was a wrong-server/slow-app false alarm.
  • Targeted diagnostics on .198 found:
    • IndeeHub frontend Podman state removing/stopping with no 7778 listener;
    • Immich server stopped, Redis exited, Postgres unhealthy, no 2283 listener;
    • Tailscale listener process existed on 8240, but direct HTTP still returned 000; logs show Tailscale is NeedsLogin/WantRunning=false, so launch must present the login/auth UI rather than a generic daemon endpoint;
    • Vaultwarden container was absent; public package.start vaultwarden failed on stale/refused Podman socket before local fixes;
    • Portainer launches but the environment wizard reports Cannot connect to the Docker daemon at unix:///var/run/docker.sock, confirming socket-backed apps are not release-ready.
  • Local follow-up fixes after those diagnostics:
    • core/container/src/runtime.rs now tries podman rm -f --time 0, targeted podman container cleanup, and another rm -f when normal forced remove fails;
    • ensure_user_podman_socket() now verifies the rootless Podman socket accepts Unix connections, not just that the socket path exists;
    • IndeeHub readiness now falls back to platform-managed network-alias presence when getent inside the API image cannot prove DNS;
    • lifecycle harness now requires Tailscale launch content to look like login/auth UI.
  • Local validation passed after those fixes:
    • cargo fmt --manifest-path core/Cargo.toml --all.
    • cargo check --manifest-path core/Cargo.toml -p archipelago -p archipelago-container.
    • cargo test --manifest-path core/Cargo.toml -p archipelago-container (45 passed).
    • bash -n tests/lifecycle/remote-lifecycle.sh.
    • git diff --check.
  • Deployed second follow-up backend hash 06420c0377fff650a2bf3211f13c1e0754bf8df81345b8485f4c9a30cb552439 to .198; service active, timers inactive.
  • Public RPC recovery attempts on hash 06420c...:
    • package.restart indeedhub still failed;
    • package.start immich accepted async start but app remained starting with no 2283 launch;
    • package.start vaultwarden accepted async start but no 8082 launch appeared;
    • package.restart portainer failed;
    • package.restart tailscale accepted async restart but no 8240 launch UI appeared.
  • Latest focused probe after hash 06420c...:
    • tailscale running, http://192.168.1.198:8240/ returns 000;
    • immich starting, http://192.168.1.198:2283/ returns 000;
    • indeedhub stopping, http://192.168.1.198:7778/ returns 000;
    • portainer running, http://192.168.1.198:9000/ returns 000;
    • vaultwarden absent/not listed, http://192.168.1.198:8082/ returns 000.
  • Conclusion: do not proceed to reboot testing or ISO work. The rootless Podman control-plane/socket health and stuck container-state recovery need a deeper platform fix before lifecycle/reboot gates are meaningful.
  • Local validation passed so far:
    • cargo fmt --manifest-path core/Cargo.toml --all.
    • cargo test --manifest-path core/Cargo.toml -p archipelago-container (45 passed).
    • cargo check --manifest-path core/Cargo.toml -p archipelago -p archipelago-container.
    • bash -n tests/lifecycle/remote-lifecycle.sh.
    • git diff --check.
  • A filtered cargo test --manifest-path core/Cargo.toml -p archipelago --bin archipelago indeedhub compiled and ran the matching existing IndeedHub test (1 passed); it did not exercise the new reboot recovery branch because there is no direct unit for that path yet.
  • Next steps:
    • deploy the new backend only after approval;
    • verify focused indeedhub,immich,tailscale,vaultwarden,portainer lifecycle/launch, including IndeeHub Nostr provider check and Portainer socket usability;
    • run reboot validation iterations on .198 only after explicit approval;
    • pass threshold: 3 consecutive clean post-fix reboots minimum, 5 preferred for production-release confidence.
    • cut and smoke-test the 1.8-alpha ISO after reboot validation is green.

Local Release Gate Completion After .198 App Recovery

  • Did not touch .198, reboot the host, change timers, or run Podman store-wide commands.
  • Fixed scanner backoff/in-flight skip behavior: skipped scans now bump scan_tick, so install/update success paths that kicked the scanner do not wait for their timeout when Podman scan backoff is active.
  • Fixed stale crash-recovery unit tests after should_auto_start_stopped_container gained the include_stack_members flag; coverage now asserts generic boot recovery skips stack helpers while stack recovery can include them.
  • Fixed local runtime manifest-port lookup so tests and local backend runs can find workspace apps/*/manifest.yml via CARGO_MANIFEST_DIR; this covers new public apps such as PhotoPrism.
  • Fixed journal usage parsing for real journalctl --disk-usage compact output such as 463.9M.
  • Fixed boot-reconciler cadence tests so without_companion_stage() also bypasses the global crash-recovery wait gate in tests; production still waits for recovery completion.
  • Verified catalog generation is idempotent: python3 scripts/generate-app-catalog.py reported updated 0 fields for both catalogs.
  • Validation passed locally:
    • cargo fmt --manifest-path core/Cargo.toml --all.
    • cargo test --manifest-path core/Cargo.toml -p archipelago --bin archipelago (688 passed).
    • cargo test --manifest-path core/Cargo.toml -p archipelago-container (43 passed).
    • cargo check --manifest-path core/Cargo.toml -p archipelago -p archipelago-container.
    • cargo check --manifest-path core/Cargo.toml -p archipelago-performance -p archipelago-security.
    • cargo test --manifest-path core/Cargo.toml -p archipelago-performance -p archipelago-security (12 security tests passed; performance has no tests).
    • cargo build --manifest-path core/Cargo.toml -p archipelago --bin archipelago --release.
    • python3 scripts/check-app-catalog-drift.py --release --strict.
    • python3 -m py_compile scripts/generate-app-catalog.py scripts/check-app-catalog-drift.py scripts/app-catalog-image-smoke-test.py.
    • git diff --check.
    • cmp -s app-catalog/catalog.json neode-ui/public/catalog.json.
  • Remaining gated item remains host reboot validation on .198, only if explicitly approved.

Frontend Release Gate Completion

  • Did not touch .198, reboot the host, change timers, or run Podman store-wide commands.
  • Found and fixed a mobile app-launch regression in neode-ui/src/stores/appLauncher.ts:
    • desktop-only new-tab apps still open directly on desktop;
    • mobile now routes those apps through the app-session route instead of escaping Archipelago in a new browser tab;
    • dashboardReturnPath() now tolerates tests/minimal router mocks with no currentRoute.
  • Updated frontend tests to match current desktop new-tab policy and mobile in-app routing behavior.
  • Fixed AppIconGrid test setup so it shares the mounted Pinia instance and mocks credential lookup before launch.
  • Fixed onboarding retry test timing to cover the actual exponential retry budget.
  • Validation passed locally:
    • npm run type-check from neode-ui.
    • npm test from neode-ui (548 passed).
    • npm run build from neode-ui.
    • python3 scripts/generate-app-catalog.py (updated 0 fields).
    • python3 scripts/check-app-catalog-drift.py --release --strict.
    • python3 -m py_compile scripts/generate-app-catalog.py scripts/check-app-catalog-drift.py scripts/app-catalog-image-smoke-test.py.
    • cmp -s app-catalog/catalog.json neode-ui/public/catalog.json.
    • git diff --check.
  • Local caveat: npm ci is currently blocked because existing neode-ui/node_modules/@alloc entries are owned by root:root. Existing installed modules were sufficient for type-check, tests, and build. Do not delete or chown this tree without explicit approval.

Fedimint/File Browser, Nostr/NPM, and IndeedHub Recovery

  • Built and deployed backend hash 95dfd8530ae9621b2f16da05d2229fe40bed7e5f6e2097cf4c87000fe97b92de to .198.
  • Fixed UI-facing package health for reachable running apps whose Podman health stayed starting, unhealthy, or a numeric exit value while the launch port was reachable.
  • Confirmed Fedimint Guardian and File Browser were actually reachable; their server.get-state package-data now reports healthy instead of “starting up”.
  • Fixed Nostr relay port conflict by moving apps/nostr-rs-relay/manifest.yml host port from 8081 to 18081.
  • Recovered Nginx Proxy Manager admin launch on 8081; Nostr now launches on 18081 and no longer captures the NPM launch port.
  • Hardened legacy package install so scoped web-app installs use podman create plus systemd-run --user --scope podman start, avoiding backend-cgroup coupling without hanging the install RPC.
  • Recovered IndeedHub without deleting data: started the stopped indeedhub-minio dependency, repaired frontend reachability, and verified 7778 returns the app.
  • Validation passed:
    • cargo check --manifest-path core/Cargo.toml -p archipelago -p archipelago-container.
    • cargo build --manifest-path core/Cargo.toml -p archipelago --bin archipelago --release.
    • python3 scripts/check-app-catalog-drift.py --release --strict.
    • Focused lifecycle for indeedhub,nginx-proxy-manager,nostr-rs-relay,fedimint,filebrowser.
    • Direct launch checks returned HTTP 200 for 7778, 8081, 18081, 8175, and 8083.
    • Broad non-destructive lifecycle passed on live hash 95dfd8530ae9621b2f16da05d2229fe40bed7e5f6e2097cf4c87000fe97b92de.
  • Final .198 state after validation: archipelago.service active; archipelago-doctor.timer inactive; archipelago-reconcile.timer inactive; / at 65% used with about 9.6G free; /var/lib/archipelago at 10% used with about 370G free.

Deployed Podman Store-Risk Cleanup

  • Reviewed release-relevant Podman store/image call sites without running broad Podman store/image commands on .198.
  • Bounded stack installer image pulls and manual package update image pulls with kill_on_drop and 600s timeouts.
  • Deployed backend hash a52a87474c9a788e058ee1da1edd6091ab305594a53e7a153889f77041598ff4 to .198 with the previous backend backed up under /usr/local/bin/archipelago.backup-20260608-store-risk-*.
  • Validation passed:
    • python3 scripts/check-app-catalog-drift.py --release --strict.
    • cargo fmt from core/.
    • cargo check --manifest-path core/Cargo.toml -p archipelago -p archipelago-container.
    • cargo build --manifest-path core/Cargo.toml -p archipelago --bin archipelago --release.
    • Focused post-deploy lifecycle: ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=fedimint,immich,indeedhub,photoprism ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh.
    • Broad post-deploy non-destructive lifecycle: ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh.
  • Final .198 state after validation: archipelago.service active; archipelago-doctor.timer inactive; archipelago-reconcile.timer inactive; / at 65% used with about 9.8G free; /var/lib/archipelago at 10% used with about 370G free.

Release Candidate Backend Restart Validation

  • Built and deployed backend hash e28affdf4c1d3cecbe4c14b0439b53d977ed20873c966c288116601d49dac732 to .198.
  • Bounded additional Podman store/control probes so image and stack health checks fail fast instead of hanging under .198 Podman store/socket load.
  • Fixed Fedimint health reporting: if Podman health remains starting but the app endpoint is reachable, container-health can use the reachable cached app fallback.
  • Fixed package start/restart fallback for runtime web apps by using systemd-run --user --scope for podman start, then falling back to direct bounded podman start.
  • Recovered live Immich without data loss:
    • immich_server had exited because /usr/src/app/upload/encoded-video/.immich could not be written.
    • Correct live ownership is still podman unshare chown -R 0:0 /var/lib/archipelago/immich, which maps to host UID/GID 1000:1000 and container root ownership.
    • A temporary 1000:1000 in-container ownership experiment was reverted because Immich's storage check writes as container root.
  • Validation passed on latest hash:
    • cargo check --manifest-path core/Cargo.toml -p archipelago -p archipelago-container.
    • cargo build --manifest-path core/Cargo.toml -p archipelago --bin archipelago --release.
    • python3 scripts/check-app-catalog-drift.py --release --strict.
    • npm run build from neode-ui.
    • ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=fedimint ARCHY_STABILITY_SECONDS=10 ARCHY_TIMEOUT=300 tests/lifecycle/remote-lifecycle.sh.
    • ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=immich ARCHY_STABILITY_SECONDS=10 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh.
    • ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh.
    • Backend restart validation followed by focused fedimint,immich,indeedhub,photoprism lifecycle passed.
    • Post-restart broad non-destructive lifecycle passed.
  • Remaining gate before calling this a release: host reboot validation, if approved.

IndeedHub and Immich Lifecycle Recovery

  • Built and deployed backend hash 89dfc3d4e801b35564dc8dc7f4a513028eb7e2027b586e8aad7a0f374e20d6a9 to .198.
  • IndeedHub focused audit is green after sequencing network alias repair immediately before frontend startup, after dependencies are running.
  • Fedimint and NetBird focused audits are green; they were not current blockers after rerun.
  • Immich was the broad-audit blocker and is now green:
    • dependency readiness accepts healthy Podman health state for immich_postgres and immich_redis before falling back to slower exec probes;
    • immich_server startup repairs /var/lib/archipelago/immich ownership with podman unshare chown -R 0:0, preserving upload data while matching the current rootless container user mapping;
    • this fixed the observed EACCES on /usr/src/app/upload/encoded-video/.immich.
  • Validation passed on latest hash:
    • cargo check --manifest-path core/Cargo.toml -p archipelago.
    • cargo build --manifest-path core/Cargo.toml -p archipelago --bin archipelago --release.
    • ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=indeedhub ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh.
    • ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=fedimint ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=300 tests/lifecycle/remote-lifecycle.sh.
    • ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=netbird ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=300 tests/lifecycle/remote-lifecycle.sh.
    • ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=immich ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh.
    • ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh.
  • Residual risk remains: .198 still intermittently logs podman ps -a --format json timed out after 30s and transient Bitcoin RPC timeouts under load. Continue avoiding store-wide Podman commands.

Release Refactor Cleanup

  • Built and deployed backend hash 14d360a206d1e58f287c5722d709dace0284b0dea56b66aa4bce0f57c631631b to .198.
  • Legacy package runtime host-port cleanup/repair now derives host ports from manifests when available.
  • Hardcoded ports remain only as fallback for legacy/non-manifest apps and extra stale-port cleanup compatibility.
  • Removed the duplicate Gitea-specific stale port cleanup helper.
  • Validation passed on latest hash:
    • cargo check --manifest-path core/Cargo.toml -p archipelago.
    • cargo build --manifest-path core/Cargo.toml -p archipelago --bin archipelago --release.
    • ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=photoprism ARCHY_STABILITY_SECONDS=1 ARCHY_TIMEOUT=120 tests/lifecycle/remote-lifecycle.sh.
    • ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh.
  • Added focused runtime-host-port tests, but local cargo test --manifest-path ../../core/Cargo.toml -p archipelago runtime_host_ports did not finish within 5 minutes during compilation.

Catalog Metadata Generation

  • Added scripts/generate-app-catalog.py to sync manifest-owned metadata into app-catalog/catalog.json and neode-ui/public/catalog.json.
  • The generator updates fields that manifests already own: title, version, description, dockerImage, category, tier, icon, and repoUrl.
  • The catalog still preserves catalog-only fields such as author, requires, featured, and rich containerConfig notes.
  • Corrected stale manifest metadata for BotFights, IndeeHub, Gitea, LND, ElectrumX, Fedimint, and Mempool before generation.
  • Release catalog drift is now zero:
    • python3 scripts/check-app-catalog-drift.py --release --strict reports metadata_drift=0, missing_catalog=0, missing_manifests=0.
  • Validation passed:
    • jq empty app-catalog/catalog.json neode-ui/public/catalog.json.
    • canonical and UI public catalogs match byte-for-byte.
    • cargo test --manifest-path core/Cargo.toml -p archipelago-container.
    • cargo check --manifest-path core/Cargo.toml -p archipelago.
    • npm run build from neode-ui.

Podman Store-Risk Hardening

  • Built and deployed backend hash eaa83c30467acd42ad864a8e0ea0d5fd88b94b775a06bfcdc460c4b0cd8e75b2 to .198.
  • Fresh local-build installs now treat podman image exists <local-build-tag> failure/timeout as "unknown/missing" and rebuild the local image instead of failing the lifecycle operation.
  • This keeps local image store checks from being release-blocking while preserving bounded runtime timeouts and matching the existing drift-restart behavior.
  • Validation passed on the latest hash:
    • cargo check --manifest-path core/Cargo.toml -p archipelago.
    • cargo build --manifest-path core/Cargo.toml -p archipelago --bin archipelago --release.
    • ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=photoprism ARCHY_STABILITY_SECONDS=1 ARCHY_TIMEOUT=120 tests/lifecycle/remote-lifecycle.sh.
    • ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh.
  • Added focused unit test coverage for the image-exists failure behavior, but local cargo test --manifest-path core/Cargo.toml -p archipelago install_fresh_builds_when_image_exists_check_fails did not complete within 15 minutes during compilation.

Container Health Fallback and Broad Lifecycle Green

  • Built and deployed backend hash be95ea91339a7fb0a3b20d0ae5d816dca220d5e5ca86838cc0ba50b609ad7b36 to .198.
  • Fixed container-health broad lifecycle timeout behavior:
    • cached_reachable_health() now parses ports from URLs with trailing slashes correctly, such as http://localhost:2342/.
    • The local TCP fallback now covers the lifecycle web app ports, including PhotoPrism, BTCPay, LND UI, Mempool, Electrum, Fedimint, Gitea, IndeedHub, Ollama, Vaultwarden, Tailscale, and others.
    • Cached-running apps with reachable local TCP listeners can report healthy without depending on flaky Podman health/inspect calls.
  • Validation passed on the latest hash:
    • cargo check --manifest-path core/Cargo.toml -p archipelago.
    • cargo build --manifest-path core/Cargo.toml -p archipelago --bin archipelago --release.
    • ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=photoprism ARCHY_STABILITY_SECONDS=1 ARCHY_TIMEOUT=120 tests/lifecycle/remote-lifecycle.sh.
    • ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh.

Generic Host-Port Health Checkpoint

  • Built and deployed backend hash 3912b900c376b6c28bf5453640cae82135f67d7e0f984b8adcc78064b924143b to .198.
  • Confirmed objective remains: app behavior should be manifest/platform-primitive owned, not OS-image or per-app backend hack owned.
  • Broad lifecycle on d21202cd... failed only on Uptime Kuma briefly showing stopping during listener repair; it recovered afterward.
  • Fixed stale transitional merge: Stopping -> Running recovers when no user-stop marker exists; user-initiated stops still keep Stopping.
  • Health monitor now derives required host TCP ports from Podman JSON Ports and marks running containers unhealthy when declared host listeners are missing.
  • This is generic host-port health, not an app-specific mapping.
  • After deploying 3912b900..., Uptime Kuma recovered 3002 and returned HTTP 302 after backend restart.
  • Jellyfin still needs follow-up: Podman reports jellyfin Up ... (healthy) with 0.0.0.0:8096->8096/tcp, but ss shows no 8096 listener and curl http://192.168.1.198:8096/ fails.
  • Follow-up on be95ea... resolved the broad lifecycle timeout by hardening container-health fallback behavior.

Stale State and Jellyfin Pasta Listener Hardening

  • Built and deployed backend hash d21202cd79794e3bfc882d37134afd7a41dac766bae386a675714e5fa030e94e to .198.
  • container-list now overlays cached exited entries with targeted live state so scanner backoff does not leave lifecycle/UI reads stuck on stale exited after recovery.
  • container-health now has a bounded cached-running plus local TCP reachability fallback for web apps, reducing dependency on slow/hung Podman inspect paths for health reads.
  • Jellyfin was added to legacy runtime host-port repair for pasta listener 8096.
  • package.restart jellyfin still exposed a real Podman socket/runtime blocker after stopping the container: Cannot connect to Podman socket at /run/user/1000/podman/podman.sock: Permission denied.
  • package.start jellyfin recovered the app afterward; jellyfin became Up ... (healthy), 8096 had a pasta.avx2 listener, and http://192.168.1.198:8096/ returned HTTP 302.
  • Focused lifecycle passed on the latest hash:
    • ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=meshtastic,jellyfin,filebrowser,uptime-kuma ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh
  • Release catalog drift check remains: missing_catalog=0, missing_manifests=0, metadata_drift=35.

Expanded Cleanup and Store-Safe Uninstall

  • Built and deployed backend hash 7f90345b75148b7ed748e1a417f31d1273e1646a9b742891858df11c5397051b to .198.
  • Expanded system.disk-cleanup to remove old rollback artifacts while keeping newest rollback points:
    • /usr/local/bin/archipelago.backup-* newest 3.
    • legacy /usr/local/bin/archipelago.bak* newest 3.
    • /usr/local/bin/archipelago.before-* newest 3 as part of legacy backend cleanup.
    • /opt/archipelago/web-ui.bak* newest 3.
    • /opt/archipelago/web-ui.old included as web UI rollback cleanup.
  • Live system.disk-cleanup reclaimed 10.3 GB:
    • Removed old backend backups: 41.6 MB freed.
    • Removed old legacy backend backups: 3.6 GB freed.
    • Removed old web UI backups: 6.6 GB freed.
    • Skipped Podman image/volume prune: Podman store commands can block app health on busy nodes.
  • /usr/local/bin dropped to about 336M.
  • /opt/archipelago dropped to about 1.1G.
  • Removed global podman volume prune -f from uninstall. Uninstall now logs a skip and still removes explicit app data when preserve_data=false.

Startup Scan and Uptime Kuma Fixes

  • Startup adopt_existing() is bounded with a 35s timeout.
  • Initial container scan seeds the same 300s Podman scan backoff used by periodic scans.
  • Legacy pasta restart paths use scoped podman restart instead of stop+start.
  • Uptime Kuma was repaired:
    • Before: container internally healthy on 127.0.0.1:3001, but host 3002 had no pasta listener.
    • After: package.restart uptime-kuma returns {"status":"restarted"} and http://192.168.1.198:3002/ returns HTTP 302.

Cleanup and Catalog Work Already Done

  • system.disk-cleanup intentionally skips Podman image/volume prune.
  • nostr-rs-relay was added to both catalog surfaces.
  • scripts/check-app-catalog-drift.py --release --strict reports zero missing catalog/manifest entries and zero metadata drift after catalog generation.
  • Meshtastic app.files live behavior was validated: deleting /var/lib/archipelago/meshtastic/config.yaml and restarting recreated it from the manifest.

Verification Already Run

  • cargo check --manifest-path core/Cargo.toml -p archipelago -p archipelago-container passed for the currently deployed release-candidate line.
  • cargo build --manifest-path core/Cargo.toml -p archipelago --bin archipelago --release passed for the currently deployed release-candidate line.
  • Broad lifecycle on current hash 14d360a206d1e58f287c5722d709dace0284b0dea56b66aa4bce0f57c631631b passed:
    • ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh
  • Targeted PhotoPrism audit on current hash passed:
    • ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=photoprism ARCHY_STABILITY_SECONDS=1 ARCHY_TIMEOUT=120 tests/lifecycle/remote-lifecycle.sh
  • Focused lifecycle on current hash d21202cd79794e3bfc882d37134afd7a41dac766bae386a675714e5fa030e94e passed:
    • ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=meshtastic,jellyfin,filebrowser,uptime-kuma ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh
  • Live cleanup RPC passed and reclaimed 10.3 GB.
  • Focused lifecycle after expanded cleanup passed:
    • ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=meshtastic,jellyfin,filebrowser,uptime-kuma ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh
  • Before the expanded cleanup pass, broad lifecycle also passed on hash 2b72e83ff368e4a696ad701f8985b0a8e1e889d9f4844056dc063455df973b28:
    • ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh
  • Direct app checks after latest cleanup passed:
    • http://192.168.1.198:3002/ -> HTTP 302.
  • http://192.168.1.198:8096/ -> HTTP 302 after Jellyfin recovery/start.
    • http://192.168.1.198:8083/ -> HTTP 404 on /, which is expected for Filebrowser root probe behavior used here.

Test Caveat

  • Earlier local focused test commands timed out during first-time test binary compilation, but after compilation completed the full backend test target passed: cargo test --manifest-path core/Cargo.toml -p archipelago --bin archipelago (688 passed).
  • Remaining workspace packages also pass checks/tests: archipelago-container, archipelago-performance, and archipelago-security.

Critical Constraints

  • Preserve app data.
  • .198 is the active validation node.
  • Current live backend hash on .198: 7e82532137292e91111f63819d1be7fa69f994ce20d6b5e0194915f194f20412.
  • Keep archipelago-doctor.timer and archipelago-reconcile.timer inactive unless explicitly testing them.
  • Do not run destructive git commands.
  • Do not run Podman store-wide cleanup or broad image/store commands on .198 without a mitigation plan:
    • Avoid podman system df.
    • Avoid podman image list / podman image ls.
    • Avoid broad podman image exists loops.
    • Avoid podman image prune and podman volume prune.
  • Podman store commands can hang and block app health under current .198 load.
  • Latest local mitigation: Rust release image-existence probes now use bounded targeted podman image inspect instead of podman image exists or podman images -q.

Current Remaining Blockers

  1. Podman socket/store health remains unresolved.

    • Need quarantine/mitigation strategy rather than store-wide commands in release paths.
    • Current release paths avoid prune and broad image-list/existence commands; orchestrator, companion, and legacy install image checks now use bounded podman image inspect.
    • Latest concrete failure remains historical: package.restart jellyfin stopped the container but failed to complete because Podman reported socket permission/runtime failure. package.start jellyfin recovered afterward.
    • Latest deployed hash still logged one initial podman ps -a --format json scan timeout/backoff, but focused and broad non-destructive lifecycle validation passed.
  2. Release code-review/refactor gate is still open.

    • Reduce remaining app-specific Rust/OS branches where possible.
    • Review scanner, health, reconcile, and install/update paths for performance and store-risk.
    • Clean up dead transitional paths.
  3. Clean release branch hygiene is not done.

    • Worktree is very dirty with many modified and untracked files.
    • Do not commit unless explicitly asked.
  4. Full production validation still needed.

    • Broad non-destructive lifecycle is green on live hash 7e82532137292e91111f63819d1be7fa69f994ce20d6b5e0194915f194f20412.
    • Backend restart validation has passed.
    • Run host reboot validation if approved.
    • Run selected full lifecycle tests for critical apps if time allows.

Files Changed In Latest Pass

  • core/container/src/runtime.rs

    • Changed Podman runtime image_exists() from podman image exists to a bounded targeted podman image inspect local-storage probe.
  • core/archipelago/src/api/rpc/package/install.rs

    • Replaced legacy podman images -q local fallback and post-pull verification checks with bounded targeted podman image inspect.
  • core/archipelago/src/container/companion.rs

    • Changed companion image existence checks from podman image exists to podman image inspect.
  • core/archipelago/src/container/prod_orchestrator.rs

    • Updated image-existence failure test fixture wording for the new image inspect probe.
  • Validation for latest local mitigation:

    • cargo fmt --all --check passed.
    • cargo check -p archipelago-container passed.
    • cargo check -p archipelago passed.
    • CARGO_INCREMENTAL=0 cargo check -p archipelago --tests passed.
    • cargo test -p archipelago-container passed (43 tests).
    • git diff --check -- <changed files> passed.
    • Filtered cargo test -p archipelago install_fresh_build did not complete: one run hit a rust-lld undefined hidden symbol artifact/link failure after concurrent Cargo jobs; the sequential CARGO_INCREMENTAL=0 rerun exceeded 10 minutes during compile, but test-target compilation passed afterward.
  • core/archipelago/src/api/rpc/system/handlers.rs

    • Calls expanded rollback cleanup helpers and reports reclaimed bytes.
  • core/archipelago/src/api/rpc/system/mod.rs

    • Added cleanup helpers for legacy backend backups and web UI rollback backups.
    • Uses size accounting for directories before removal.
    • Keeps newest rollback artifacts instead of deleting all.
  • core/archipelago/src/api/rpc/package/runtime.rs

    • Skips global podman volume prune -f during uninstall.
    • Adds Jellyfin 8096 to runtime host-port/pasta cleanup repair.
    • Derives legacy runtime host-port cleanup/repair ports from manifests.
    • Keeps compatibility fallback ports for legacy/non-manifest apps and removes duplicate Gitea stale-port cleanup code.
  • core/archipelago/src/api/rpc/container.rs

    • Adds stale cached exited refresh for container-list.
    • Adds cached-running plus local TCP reachability fallback for container-health.
    • Fixes fallback URL port parsing and expands lifecycle web app port coverage.
  • core/archipelago/src/container/prod_orchestrator.rs

    • Rebuilds local-build images when image_exists fails/times out instead of failing fresh install.
    • Adds focused unit test coverage for that behavior.
  • scripts/generate-app-catalog.py

    • Generates/syncs public catalog metadata from manifest-owned fields.
  • app-catalog/catalog.json and neode-ui/public/catalog.json

    • Generated from current manifests; files match byte-for-byte.
  • docs/CONTAINER_LIFECYCLE_HANDOFF.md

    • Added latest deployment, cleanup, validation, and residual-risk checkpoint.
  • docs/MIGRATION_STATUS_REPORT.md

    • Updated current hash, root disk state, and remaining blockers.
  • docs/RESUME.md

    • This file, replacing stale April migration resume content.

Suggested Next Steps

  1. Re-read the three docs:

    • docs/RESUME.md
    • docs/CONTAINER_LIFECYCLE_HANDOFF.md
    • docs/MIGRATION_STATUS_REPORT.md
  2. Verify latest .198 state:

    • ssh -i /home/archipelago/.ssh/id_ed25519 -o StrictHostKeyChecking=no archipelago@192.168.1.198 'df -h / /var/lib/archipelago; systemctl is-active archipelago.service; systemctl is-active archipelago-doctor.timer 2>/dev/null || true; systemctl is-active archipelago-reconcile.timer 2>/dev/null || true; sha256sum /usr/local/bin/archipelago'
  3. Start Podman-store-risk review:

    • Search for image/store operations: image_exists, podman image, podman system, podman prune, volume prune.
    • Prefer targeted container status/API calls with timeouts.
    • Avoid new broad store commands.
  4. Continue release code-review/refactor cleanup.

  5. If approved, run backend-restart validation and then host-reboot validation.


Current Release Readiness Estimate

  • Credible release candidate: closer now, roughly 87-91%.
  • Production-quality release developers will love: still closer to 73-79%.

The biggest improvement in the latest pass is that broad lifecycle is green again on the latest backend. The biggest remaining technical risk is Podman store/socket health.