feat(orchestrator): complete container migration and release hardening
This commit is contained in:
238
docs/RESUME.md
238
docs/RESUME.md
@@ -1,190 +1,126 @@
|
||||
# RESUME — Install UX polish round (v1.7.43-alpha)
|
||||
# RESUME — Rust orchestrator migration, Step 8b
|
||||
|
||||
Last updated: 2026-04-23
|
||||
Last updated: 2026-04-23 (evening, post-architecture-audit)
|
||||
|
||||
Read this first if you're a fresh OpenCode session resuming the install/uninstall/update UX work.
|
||||
Read this first if you're a fresh OpenCode session resuming work. Paste the "Resume prompt" below verbatim.
|
||||
|
||||
---
|
||||
|
||||
## Where we are right now
|
||||
## Resume prompt (paste this into a new opencode session)
|
||||
|
||||
**v1.7.43-alpha shipped and deployed to .228**. Latest addition: image-versions.sh path bug fixed (silent update-check failure on all production nodes). User is about to walk the marketplace app-by-app on `.228` to shake out any remaining broken apps. Tracker for that walk: `docs/MARKETPLACE-QA.md`.
|
||||
|
||||
Commits on `.116:main` (newest first, unpushed per user mirror protocol):
|
||||
- `a9908597` fix(image-versions): locate image-versions.sh at its actual deployed path
|
||||
- `013e8df0` docs(resume): add RESUME.md for context-restart recovery
|
||||
- `f9fef8d2` docs(status): record rounds 3-5 + config migration + changelog as shipped
|
||||
- `008da477` docs(changelog): add v1.7.43-alpha entry covering async lifecycle + .23 retirement
|
||||
- `0ee16820` fix(config): auto-purge decommissioned .23 VPS from saved registry/mirror configs
|
||||
- `22052325` chore: retire .23 VPS mirror, promote .168 OVH to primary
|
||||
- `f86d86c3` fix(install): kick scanner post-install so Launch button appears immediately
|
||||
- `8cc84ebc` feat(install): phase-based progress bar replaces unparseable pull bytes
|
||||
- `2d5b859e` feat(rpc): async-spawn install/uninstall/update lifecycle (Round 2)
|
||||
- `0733ac40` fix(ui): shorten install/uninstall/update timeouts for async RPCs (Round 2)
|
||||
- `e471ef75` fix(rpc): empty icon in transient install entry (Round 2)
|
||||
|
||||
**Deployed artifacts on .228**:
|
||||
- Backend: `/usr/local/bin/archipelago` md5 `9b8ead06aaf210b85cd78fce270384e3` (includes image-versions path fix)
|
||||
- Frontend: `/opt/archipelago/web-ui/` (v1.7.43-alpha changelog with 5 bullets, .168-only registry)
|
||||
- Rollback backups: `/usr/local/bin/archipelago.bak-pre-async-install` + `/opt/archipelago/web-ui.bak-pre-async-install/`
|
||||
|
||||
**Rollback command** (if catastrophic):
|
||||
```
|
||||
ssh archy228 'sudo cp -a /usr/local/bin/archipelago.bak-pre-async-install /usr/local/bin/archipelago && sudo rsync -a --delete /opt/archipelago/web-ui.bak-pre-async-install/ /opt/archipelago/web-ui/ && sudo systemctl restart archipelago && sudo systemctl reload nginx'
|
||||
```
|
||||
> We are mid-migration: `docs/rust-orchestrator-migration.md` + `docs/bulletproof-containers.md` are the plan, Steps 1–7 + 8a are shipped on `main`, Step 8b is next. Read `docs/RESUME.md` + `docs/STEP-8B-PORT-AUDIT.md` in full. Do NOT run any container mutations or edit `scripts/container-specs.sh`, `scripts/first-boot-containers.sh`, or `scripts/reconcile-containers.sh` — those are dead code scheduled for deletion in Step 8c. Work happens in `core/container/src/manifest.rs`, `core/archipelago/src/container/prod_orchestrator.rs`, and `apps/<id>/manifest.yml`. Summarize back to me what you understand the current state to be, wait for approval before touching anything.
|
||||
|
||||
---
|
||||
|
||||
## Immediate next step
|
||||
## Standing directive from the user
|
||||
|
||||
**Phase 1 — browser verification of v1.7.43-alpha on https://192.168.1.228/**
|
||||
> Please get back to a well architected, minimal as possible, perfect working container architecture. If we've gone off track and the system is getting complex rather than elegant and perfect best containers ever then we need to review all the current state of the system and get back to making the best container system ever and according to our projects goals. We will be working on this until it's perfect.
|
||||
|
||||
1. Settings → About: top changelog entry reads "v1.7.43-alpha · Apr 23, 2026" with **5** bullets. Last bullet mentions "Update-available badges and version comparisons work again across every app." Hard-refresh (Cmd+Shift+R) if stale.
|
||||
2. Settings → App Registries: only `146.59.87.168:3000/lfg2025` + `git.tx1138.com`. No .23.
|
||||
3. Settings → System Update → Update Mirrors: only `.168` (Server 1 primary) + `tx1138` (Server 2). No .23.
|
||||
4. Install SearXNG (small, fast image). Expect: instant button response, 7 phase labels in progress bar (Preparing → Pulling image → Creating container → Starting container → Waiting for healthy → Finalizing → Done), Launch button appears within ~3s of "Done".
|
||||
5. Uninstall: snappy, no freeze.
|
||||
**Interpretation (validated with the user):** resume the Rust orchestrator migration. Stop patching bash scripts. The bash scripts were supposed to be deleted three months of commits ago and we drifted into maintaining them by accident.
|
||||
|
||||
**Phase 2 — marketplace walk (app-by-app on .228)**
|
||||
## Latest user comment (must be followed)
|
||||
|
||||
Once Phase 1 is clean, user will install every app in the marketplace catalog one by one. Tracker: `docs/MARKETPLACE-QA.md`. For each broken app:
|
||||
- Triage via `journalctl -u archipelago`, `podman ps -a`, `podman logs <name>`.
|
||||
- Identify layer: app recipe / registry image / backend / frontend.
|
||||
- Fix, commit `fix(app/<name>): ...` or similar.
|
||||
- Redeploy as needed.
|
||||
- Append release-note bullet for the fix (to current in-flight version, or bump to v1.7.44-alpha if the pile grows).
|
||||
- User re-verifies, mark ✅ in the tracker.
|
||||
> please continue, please state my last comment in the resume doc and first before making this plan to adhere to
|
||||
|
||||
Known pre-existing issue to expect: **Vaultwarden** container exits immediately on start. Backend correctly detects + removes state entry; needs container-config debug.
|
||||
Adherence rule for this session:
|
||||
- Before proposing or executing a plan, first record the user's latest directive in `docs/RESUME.md`.
|
||||
- Keep work aligned to Step 8 migration goals and avoid off-scope drift.
|
||||
|
||||
Most recent directive:
|
||||
|
||||
> And we need to get every container working on .116 and tested before we release
|
||||
|
||||
Release gate update:
|
||||
- `.116` must have all required containers healthy and tested before release is allowed.
|
||||
- Treat runtime stabilization on `.116` as immediate priority while continuing Step 8 migration work.
|
||||
|
||||
---
|
||||
|
||||
## Overall mission (unchanged)
|
||||
## Where we actually are
|
||||
|
||||
User mandate: _"best server containers in the world"_. Polish install/uninstall/update flows for all 6 bundled server containers + marketplace apps before release. Tackle UX issues one by one in order.
|
||||
### Shipped (Steps 1–7 + 8a)
|
||||
|
||||
Commits on `main` (unpushed to `origin`/tx1138 until release gate; user-visible history):
|
||||
|
||||
| Step | Commit | What |
|
||||
|------|--------|------|
|
||||
| 1 | (schema in place from earlier commits) | `ContainerConfig.image` ⊕ `ContainerConfig.build` — mutually exclusive pull-or-build source |
|
||||
| 2 | `34af4d9d` | `ContainerRuntime` trait gains `image_exists` + `build_image`; `PodmanRuntime` impl |
|
||||
| 3 | `b6a04d31` | `ProdContainerOrchestrator` with build-or-pull + adoption + reconcile |
|
||||
| 4 | `e8a59c93` | `ContainerOrchestrator` trait; `RpcHandler` uses it in prod |
|
||||
| 5 | `fc39b04b` | `BootReconciler` — periodic reconcile loop |
|
||||
| 6 | `48f08aa3` | Wire both into `main.rs` |
|
||||
| 7 | `069bc4a5` | `bitcoin-ui` pre-start hook renders `nginx.conf` from embedded template (the pattern for "derived config" at apply time) |
|
||||
| 8a | `a0707f4d`, `1c81a739` | Retire `archipelago-reconcile` systemd timer; split Step 8 into 8a/8b/8c |
|
||||
|
||||
Three `apps/*/manifest.yml` are genuinely ported and running under the Rust orchestrator on `.116` + `.228`: `bitcoin-ui`, `electrs-ui`, `lnd-ui` (Step 7).
|
||||
|
||||
### Where we drifted (the session that produced the previous RESUME.md)
|
||||
|
||||
On 2026-04-23 a fedimint outage on `.116` pulled a session into patching `scripts/reconcile-containers.sh`, `scripts/container-specs.sh`, `scripts/first-boot-containers.sh` — files that Step 8c is scheduled to delete. Five bugs deep, the user halted the session. That cluster of bugs is a symptom of running two incompatible codepaths in parallel (bash first-boot/reconcile + Rust `BootReconciler`), which is exactly the condition Step 8c fixes by deleting the bash half.
|
||||
|
||||
**Discard-of-scope decision:** the uncommitted bash edits on `.116` (listed in the previous RESUME.md's "Uncommitted script changes" section) are not going to be committed. The fedimint mDNS-URLs fix, the filebrowser custom-args fix, the bcrypt-escape fix — these all land as changes to `apps/<id>/manifest.yml` + the Rust orchestrator in Steps 8b.0 – 8b.3. See `docs/STEP-8B-PORT-AUDIT.md` for the exact mapping.
|
||||
|
||||
### Current container state on `.116`
|
||||
|
||||
Running but drifted. See the "Current container state" section in the previous RESUME.md. Decision (approved by user): accept `.116` is limping until 8b.3 lands. Do not run `scripts/reconcile-containers.sh` or any mutations; all rescues go through the Rust orchestrator or wait for the manifest port.
|
||||
|
||||
`.228` is happier — it's already adopted by the Rust orchestrator for the three UI apps.
|
||||
|
||||
---
|
||||
|
||||
## Working layout — SSH + SSHFS
|
||||
## Next step — Step 8b.0
|
||||
|
||||
- SSHFS mount: `/Users/dorian/mnt/archy-thinkpad/` → `archy:Projects/archy/`. Use for all file ops (read/edit/write/glob/grep).
|
||||
- Direct SSH: `ssh archy` (= `archipelago@192.168.1.116`, ThinkPad dev). Use for git/cargo/npm/systemctl.
|
||||
- Demo node: `ssh archy228` (= `archipelago@192.168.1.228`). NOPASSWD sudo. Dashboard login pw: `password123`.
|
||||
- Sudo pw on .116: `ThisIsWeb54321@`. Fallback sudo pw on .228: `archipelago`.
|
||||
- Cargo: `~/.cargo/bin/cargo` on .116. Long builds: `nohup ... & disown` to `/tmp/cargo-build-*.log`.
|
||||
- SSHFS flake: `write` sometimes returns `NotFound: FileSystem.readFile` on new files — retry once.
|
||||
**Concretely:** schema extensions to `core/container/src/manifest.rs` + unit tests. No orchestrator changes, no manifest changes, no container mutations.
|
||||
|
||||
## Deploy recipes
|
||||
Fields to add (justified in `docs/STEP-8B-PORT-AUDIT.md§Schema gaps`):
|
||||
|
||||
**Backend binary** (can't cp while running — "Text file busy"; binary ferries via Mac because .116 can't resolve archy228):
|
||||
```
|
||||
# On .116:
|
||||
~/.cargo/bin/cargo build --release # ~3.5 min
|
||||
# From Mac:
|
||||
scp archy:Projects/archy/core/target/release/archipelago /tmp/archipelago-new
|
||||
scp /tmp/archipelago-new archy228:/tmp/archipelago-new
|
||||
ssh archy228 'sudo systemctl stop archipelago && sudo cp /tmp/archipelago-new /usr/local/bin/archipelago && sudo systemctl start archipelago && sudo systemctl reload nginx'
|
||||
```
|
||||
- `container.network: Option<String>` — podman `--network` value (`"archy-net"`, `"host"`, or `None` = isolated default).
|
||||
- `container.custom_args: Vec<String>` — appended to the container command.
|
||||
- `container.entrypoint: Option<Vec<String>>` — override.
|
||||
- `container.derived_env: Vec<{key, template}>` — template strings resolved against `HostFacts { host_ip, host_mdns, disk_gb }` at apply time.
|
||||
- `container.secret_env: Vec<{key, secret_file}>` — read from `/var/lib/archipelago/secrets/<file>` at apply time.
|
||||
- `container.data_uid: Option<String>` — `"NNNNN:NNNNN"` applied via `chown -R` before container create.
|
||||
- `Volume.volume_type: "tmpfs"` + `Volume.tmpfs_options: String` — OR a new `container.tmpfs: Vec<{target, options}>`. Pick one at implementation time.
|
||||
|
||||
**Frontend** (rsyncs via Mac):
|
||||
```
|
||||
ssh archy 'cd ~/Projects/archy/neode-ui && npm run build' # outputs to ../web/dist/neode-ui/
|
||||
rsync -az --delete archy:Projects/archy/web/dist/neode-ui/ /tmp/archy-web/
|
||||
rsync -az /tmp/archy-web/ archy228:/tmp/archy-web/
|
||||
ssh archy228 'sudo rsync -a --delete /tmp/archy-web/ /opt/archipelago/web-ui/ && sudo systemctl reload nginx'
|
||||
```
|
||||
**Tests** (block the commit until green):
|
||||
|
||||
Note: frontend source is `neode-ui/` (has package.json). `web/` has no package.json; `web/dist/neode-ui/` is the build output.
|
||||
- Every existing `apps/*/manifest.yml` still parses (`parse_every_real_manifest` test).
|
||||
- Each new field parses correctly with sensible defaults.
|
||||
- `validate()` rejects: empty custom_args elements, empty entrypoint elements, duplicate derived_env keys, derived_env templates referencing unknown host facts, secret_env with `..` or `/` in secret_file (path-traversal guard).
|
||||
- `resolve_env(HostFacts)` returns expected strings for each supported placeholder.
|
||||
- `resolve_secret_env(SecretsProvider)` returns expected strings; missing secret file is a hard error.
|
||||
|
||||
## Commit protocol
|
||||
|
||||
- Never push. User mirrors to Gitea remotes manually.
|
||||
- Conventional Commits. No em-dashes or fancy punctuation.
|
||||
- Multi-line messages via `tmp-commit-msg.txt`: `git commit -F tmp-commit-msg.txt && rm tmp-commit-msg.txt`.
|
||||
- Git remotes on .116: `gitea-local`, `gitea-vps2` (.168 OVH), `tx1138` (canonical), `origin` (multi-push alias). `.23` URLs were removed from origin and `gitea-vps` remote was deleted — working-copy change, not in any commit.
|
||||
|
||||
## Verification gates
|
||||
|
||||
1. `cargo check`
|
||||
2. `cargo test -p archipelago --bin archipelago <filter>` (MUST use `--bin archipelago`; no lib target)
|
||||
3. `cargo build --release`
|
||||
|
||||
Known issues:
|
||||
- `rust-lld: undefined hidden symbol` → cargo bug with test+release incremental collision. Fix: `rm -rf core/target/debug/incremental` and retry.
|
||||
- 22 pre-existing `cargo test` failures in unrelated modules (mesh/wallet/credentials/avatar/session/transport/update-mirrors/fips/identity_manager/image_versions). Not blocking. Tech debt.
|
||||
This is the smallest useful commit and unblocks every port in 8b.1+.
|
||||
|
||||
---
|
||||
|
||||
## Architecture — locked-in patterns
|
||||
## Project ground rules (standing)
|
||||
|
||||
### Async-spawn lifecycle (install/update/uninstall/start/stop/restart)
|
||||
|
||||
- RPC returns `{status, package_id}` immediately (15s client timeout).
|
||||
- Wrapper flips state to transitional variant (Installing/Updating/Removing/Stopping/Starting/Restarting) BEFORE spawn.
|
||||
- `tokio::spawn` runs existing monolithic inner handler with `self: Arc<Self>`.
|
||||
- Install/update success: MUST explicitly write terminal Running state. `merge_preserving_transitional` in `server.rs` refuses to let scanner overwrite transitional states.
|
||||
- Uninstall success: inner handler removes the entry itself.
|
||||
- On error: revert pre-transition state (or remove entry for install).
|
||||
|
||||
Key files:
|
||||
- `core/archipelago/src/api/rpc/package/async_lifecycle.rs` — full install/update/uninstall wrappers
|
||||
- `core/archipelago/src/api/rpc/transitional.rs` — start/stop/restart wrappers
|
||||
- `core/archipelago/src/server.rs:832-871` — `merge_preserving_transitional`, `is_transitional`
|
||||
- `core/archipelago/src/server.rs:295-380` — scan loop with `tokio::select!` and tick bump
|
||||
|
||||
### Install progress (phase-based, 7 levels)
|
||||
|
||||
- `podman pull` emits zero parseable progress when stderr is piped (no TTY). Legacy byte regex never matched.
|
||||
- Phases + UI %: Preparing (5) → PullingImage (20) → CreatingContainer (70) → StartingContainer (80) → WaitingHealthy (88) → PostInstall (95) → Done (100).
|
||||
- UI bar only advances forward (`Math.max`).
|
||||
- Final phase label is "Finalizing…" (renamed from "Running post-install…" which confused users).
|
||||
|
||||
Key files:
|
||||
- `neode-ui/src/stores/server.ts:25-33` — `PHASE_INFO` mapper
|
||||
|
||||
### Scanner kick (instant Launch button)
|
||||
|
||||
- Scan runs every 60s. Post-install state flipped to Running but skeletal manifest (`interfaces: None`) persisted until next scan → `canLaunch(pkg)` false for up to 60s.
|
||||
- `lan_address` derived from live container port bindings. `manifest.interfaces.main.ui` only populated when `lan_address.is_some() || tor_address.is_some()`.
|
||||
- Fix: `scan_kick: Arc<Notify>` + `scan_tick: Arc<watch::Sender<u64>>` on `RpcHandler`. Scan loop `tokio::select!` between 60s tick + notify. `kick_scanner_and_wait` helper (2s timeout) called in install/update success paths BEFORE writing Running. Merge during Installing keeps state + takes fresh manifest.
|
||||
|
||||
Key files:
|
||||
- `core/archipelago/src/api/rpc/mod.rs:89-93` — fields on RpcHandler; accessors :186-199
|
||||
- `core/archipelago/src/api/rpc/package/async_lifecycle.rs:405-430` — `kick_scanner_and_wait`
|
||||
- `core/archipelago/src/container/docker_packages.rs:132-218` — where `lan_address` + manifest get populated
|
||||
- `neode-ui/src/views/apps/appsConfig.ts:106-111` — `canLaunch(pkg)`
|
||||
- `neode-ui/src/views/apps/AppCard.vue:141-149` — Launch button render
|
||||
|
||||
### Config migration (.23 auto-purge)
|
||||
|
||||
- `load_mirrors` + `load_registries` normally only ADD missing defaults ("explicit removals stick").
|
||||
- .23 was a default the user never chose, so we need the opposite: strip it.
|
||||
- `.retain(|m| !m.url.contains("23.182.128.160"))` before defaults-merge step. Narrow-scope exception, commented in-code.
|
||||
- Triggers lazily on next load (install RPC, update RPC, Settings UI open). Not tied to boot.
|
||||
|
||||
Key files:
|
||||
- `core/archipelago/src/container/registry.rs` — `load_registries`
|
||||
- `core/archipelago/src/update.rs` — `load_mirrors`
|
||||
- `archy` SSH alias = `.116`. `archy228` = `.228`. **Do not swap.**
|
||||
- SSHFS at `/Users/dorian/mnt/archy-thinkpad/` = `archy:Projects/archy/`.
|
||||
- `.116` sudo password: `ThisIsWeb54321@` — works passwordless in-session via `sudo -nS` after first use.
|
||||
- `.228` has NOPASSWD.
|
||||
- Git commits on `.116` MUST use `git commit -F /tmp/tmp-msg.txt` over `ssh archy` — SSHFS `git commit` hangs.
|
||||
- Never push except current release (granted: `gitea-local` + `gitea-vps2`).
|
||||
- No em-dashes. Conventional Commits.
|
||||
- No altcoin mentions, Bitcoin-only.
|
||||
|
||||
---
|
||||
|
||||
## Backlog — after v1.7.43 verification
|
||||
## Recommended next action for the fresh session
|
||||
|
||||
1. User reports browser-verification results. Fix anything that fails.
|
||||
2. Continue user's "one by one" install/uninstall/update UX queue — ask for next issue.
|
||||
3. Tech debt (low priority, not blocking release):
|
||||
- Vaultwarden container exits immediately on start (separate container-config issue).
|
||||
- 22 pre-existing cargo test failures in unrelated modules.
|
||||
- "Server 3 (OVH)" historical changelog entries in `AccountInfoSection.vue` left intact (user approved — they're release notes for what shipped at the time).
|
||||
1. Read this file + `docs/STEP-8B-PORT-AUDIT.md` + the "Open decisions" section of the audit.
|
||||
2. Answer the four open decisions (or confirm the recommended defaults).
|
||||
3. Implement 8b.0 commit 1: add `network`, `custom_args`, `entrypoint`, `derived_env`, `secret_env`, `data_uid` fields to `ContainerConfig` + validation + unit tests. Backwards-compat: every existing `apps/*/manifest.yml` must still parse.
|
||||
4. Commit + `cargo test -p archipelago-container` + stop.
|
||||
|
||||
Do not touch `scripts/*.sh`. Do not run `reconcile-containers.sh`. Do not live-test on `.116` or `.228` until the schema + orchestrator pieces in 8b.0 + 8b.1 are both in.
|
||||
|
||||
---
|
||||
|
||||
## User preferences (must follow)
|
||||
## Recent release (out of scope, for grep context)
|
||||
|
||||
- Always state which option is "best long-term" first and explain why in plain terms. Trust my recommendation unless overridden.
|
||||
- "Tackle them one by one in order" — fix issues sequentially, not in a big bang.
|
||||
- Bitcoin-only. No altcoins, no proprietary deps without approval.
|
||||
- Prefer established OSS, crypto-first libs (rustls, argon2, ed25519), privacy-focused (no telemetry), minimal dep trees.
|
||||
- Atomic commits.
|
||||
- Never commit secrets. Pin dependency versions.
|
||||
- Never push — user mirrors to Gitea manually.
|
||||
v1.7.43-alpha shipped yesterday: tarball-only OTA, async install/uninstall/update lifecycle, install UX polish, `.23` VPS retirement. Manifest at `gitea-local` + `gitea-vps2`. `.228` on the new binary. See `docs/STATUS.md` for the full rundown.
|
||||
|
||||
Earlier session notes (container rescue on `.116`, "never fails" directive, env-drift detector experiment) are obsolete — superseded by this file. The directive ("never fails") is honored by the Step 8 migration itself: a declarative manifest regenerated on every reconcile tick can't bake stale IPs into consensus data because the env comes from derived/secret sources that are re-resolved every apply.
|
||||
|
||||
Reference in New Issue
Block a user