44 KiB
Overnight Plan — Production Excellence
Systematically fix every production-readiness issue across the Archipelago codebase. Each task is self-contained and behavior-preserving. Full issue registry and architectural context in
.claude/plans/plan.md. CRITICAL: Deploy ONLY to .198 (ssh -i ~/.ssh/archipelago-deploy archipelago@192.168.1.198). Never deploy to .228. Follow all rules in CLAUDE.md. Atomic commits withfix:orrefactor:prefix.
Phase 1: P0 Backend — Hangs, Data Loss, Missing Handlers
-
R1 — Add health RPC endpoint handler: In
core/archipelago/src/api/rpc/mod.rs, the"health"method is listed inUNAUTHENTICATED_METHODS(line 113) but has NO match arm in the dispatcher — it returns "Unknown method" error. Add a handler that returns JSON with:{"status": "ok", "crash_recovery_complete": bool, "uptime_seconds": u64, "version": "..."}. Check if crash recovery is done via the server state. Return"degraded"status if recovery is still in progress. Test by runningcurl -s -X POST http://192.168.1.198/rpc/v1 -H 'Content-Type: application/json' -d '{"method":"health"}'and verifying real status JSON (not an error). Runcargo clippy --all-targets --all-featuresandcargo test --all-featureson the dev server after changes. -
R2 — Add timeout to Nostr client.connect(): In
core/archipelago/src/nostr_handshake.rs, there are 4 calls toclient.connect().awaitwith NO timeout at lines 124, 161, 262, and 282. If a Nostr relay is down, these hang forever. Wrap each one intokio::time::timeout(Duration::from_secs(10), client.connect()).await. Handle the timeout error by logging a warning and continuing (Nostr is best-effort). Note thatfetch_events()already has timeouts (lines 168, 370) — match that pattern. Runcargo clippyandcargo testafter. -
R3 — Make backup restore atomic with rollback: In
core/archipelago/src/backup/full.rslines 122-149,restore_full_backup()extracts tar directly to the live data directory. If extraction fails halfway, the system is left in corrupt partial state. Fix by: (1) Check disk space before starting — at least 2x backup size free. (2) Extract to a staging directory (data_dir.join(".restore-staging")). (3) Validate the staging dir has required files (identity/, sessions.json at minimum). (4) Rename current data_dir contents to.restore-backup. (5) Move staging contents to data_dir. (6) On any failure, restore from.restore-backup. (7) Clean up staging/backup dirs on success. Usetokio::fsfor all operations. Runcargo testafter. -
I1 — Protect unauthenticated nginx endpoints: In
image-recipe/configs/nginx-archipelago.conf, the/archipelago/location block (lines 116-121),/content(line 166+), and/dwn(line 176+) have NO timeout, rate-limit, or body size protection. Add to each of these three location blocks:limit_req zone=rpc burst=20 nodelay;,client_max_body_size 10m;,proxy_connect_timeout 30s;,proxy_read_timeout 60s;,proxy_send_timeout 30s;. If alimit_req_zonenamedrpcdoesn't exist, check the existing zones at the top of the config and either use an existing one or addlimit_req_zone $binary_remote_addr zone=peer:10m rate=10r/s;in the http block and referencezone=peer. Verify syntax withnginx -ton .198 after deploying. -
Phase 1 verification gate: SSH to .198 and run:
cargo clippy --all-targets --all-features(zero warnings),cargo test --all-features(all pass). Deploy with./scripts/deploy-to-target.sh --target 192.168.1.198. Then runcurl http://192.168.1.198/healthand verify it returns real JSON status. Runcurl -X POST http://192.168.1.198/rpc/v1 -H 'Content-Type: application/json' -d '{"method":"health"}'and verify JSON response with status field.
Phase 2: P0 Frontend — Race Conditions and Silent Failures
-
F1 — Fix WebSocket subscription race condition: In
neode-ui/src/stores/app.tslines 88-134,connectWebSocket()uses a localisWsSubscribedflag to prevent double-subscription, but two rapid calls can both pass the check before either sets it. Fix: (1) MoveisWsSubscribedto a module-levelletvariable initialized tofalse(if not already). (2) Add an early return if WebSocket is already connecting:if (wsClient.isConnecting()) return;. (3) Before subscribing, callwsClient.unsubscribeAll()to clear any prior callbacks, THEN subscribe fresh. This ensures exactly one callback is active regardless of how many timesconnectWebSocket()is called. Runcd neode-ui && npm run type-checkafter. -
F2 — Protect mesh store concurrent mutations: In
neode-ui/src/stores/mesh.ts,sendMessage()(line 249),sendInvoice(), andsendCoordinate()all callfetchMessages()after sending, but multiple concurrent calls can race. Fix: Add aconst sendQueue = ref<Promise<void>>(Promise.resolve())at module level. Each send function chains onto it:sendQueue.value = sendQueue.value.then(() => doSend(...)). This serializes sends sofetchMessages()is never called concurrently. Runnpm run type-checkafter. -
F3 — Add global Vue error handler: In
neode-ui/src/main.ts, there is noapp.config.errorHandler. Any component error causes a white screen. Afterconst app = createApp(App), add:app.config.errorHandler = (err, instance, info) => { console.error('[Vue Error]', err, info); const { showError } = useToast(); showError('Something went wrong. Please refresh the page.'); };. ImportuseToastfrom@/composables/useToast. Runnpm run type-checkafter. -
S1 — Eliminate all sudo podman in scripts: Run
grep -rn 'sudo podman' scripts/ indeedhub/ image-recipe/to find all instances. In each script, replacesudo podmanwithpodman. The main offenders are:scripts/fix-indeedhub-containers.sh(28 instances),scripts/deploy-bitcoin-knots.sh(11 instances),scripts/deploy-tailscale.sh(check for any remaining),scripts/uptime-monitor.sh,scripts/setup-aiui-server.sh. After replacing, verify withgrep -rn 'sudo podman' scripts/ indeedhub/ image-recipe/— should return zero results. Do NOT changedocs/files (those are historical records). -
S2 — Add health checks to all containers in first-boot: In
scripts/first-boot-containers.sh, every$DOCKER runcommand needs--health-cmd,--health-interval=30s,--health-timeout=5s,--health-retries=3. Use appropriate health commands: For Bitcoin Knots (line ~253):--health-cmd="bitcoin-cli -rpcuser=\$BITCOIN_RPC_USER -rpcpassword=\$BITCOIN_RPC_PASS getblockchaininfo || exit 1". For HTTP apps (Mempool, BTCPay, Grafana, etc.):--health-cmd="curl -sf http://localhost:{PORT}/ || exit 1". For LND:--health-cmd="curl -sf --insecure https://localhost:8080/v1/getinfo || exit 1". For databases (MariaDB):--health-cmd="mariadb -uroot -e 'SELECT 1' || exit 1". For ElectrumX:--health-cmd="curl -sf http://localhost:50002/ || exit 1". After editing, verify withgrep -c 'health-cmd' scripts/first-boot-containers.sh— should match the number of$DOCKER runcommands. -
Phase 2 verification gate: Run
cd neode-ui && npm run type-check(zero errors). Runcd neode-ui && npm test(all pass). Rungrep -rn 'sudo podman' scripts/ indeedhub/ image-recipe/ | grep -v docs/ | grep -v '#'and verify zero results. Rungrep -c 'health-cmd' scripts/first-boot-containers.shand verify count matches number of container run commands.
Phase 3: P1 Backend — Blocking I/O in Async Context
-
R6 — Fix session.rs blocking I/O (6 calls): In
core/archipelago/src/session.rs, replace: (1) Line 77:std::fs::read_to_string()→tokio::fs::read_to_string().await. (2) Line 128:std::fs::write()→tokio::fs::write().await. (3) Line 370:std::fs::read()→tokio::fs::read().await. (4) Line 413:std::fs::read()→tokio::fs::read().await. (5) Line 423:std::fs::create_dir_all()→tokio::fs::create_dir_all().await. (6) Line 425:std::fs::write()→tokio::fs::write().await. Make sure the containing functions areasync fn. Adduse tokio::fs;at top if not present. Runcargo clippyandcargo testafter. -
R7 — Fix docker_packages.rs blocking I/O: In
core/archipelago/src/container/docker_packages.rs, replacestd::fs::read_to_string()at lines 561 and 573 withtokio::fs::read_to_string().await. Ensure the containing function is async. Runcargo testafter. -
R8 — Fix port_allocator.rs blocking I/O: In
core/archipelago/src/port_allocator.rs, replace: (1) Line 59:std::fs::read_to_string()→tokio::fs::read_to_string().await. (2) Line 73:std::fs::create_dir_all()→tokio::fs::create_dir_all().await. (3) Line 77:std::fs::write()→tokio::fs::write().await. Runcargo testafter. -
R9+R10+R11 — Fix remaining blocking I/O across 5 files: (1)
core/archipelago/src/peers.rsline 30:fs::read_to_string()→tokio::fs::read_to_string().await. (2)core/archipelago/src/node_message.rsline 65:std::fs::write()→tokio::fs::write().await. (3)core/archipelago/src/identity.rsline 50:fs::set_permissions()→tokio::fs::set_permissions().await. (4)core/archipelago/src/identity_manager.rsline 164:fs::set_permissions()→tokio::fs::set_permissions().await. (5)core/archipelago/src/nostr_discovery.rsline 55:std::fs::set_permissions()→tokio::fs::set_permissions().await. Runcargo clippyandcargo testafter all changes. -
R12 — Fix electrs_status.rs sync TCP I/O: In
core/archipelago/src/electrs_status.rs, the entire module uses synchronous TCP I/O (std::net::TcpStream,BufReader,write_all). Convert to async usingtokio::net::TcpStreamandtokio::io::{AsyncBufReadExt, AsyncWriteExt}. Replacestd::fs::read_dir()at line 40 withtokio::fs::read_dir().await. Wrap the TCP connection in atokio::time::timeout(Duration::from_secs(5), ...)to prevent hangs if ElectrumX is down. Runcargo clippyandcargo testafter. -
R4+R5 — Spawn rate limiter cleanup tasks: In
core/archipelago/src/session.rs, theEndpointRateLimiter::cleanup()method (lines 566-579) andLoginRateLimitercleanup exist but are never called. In theRpcHandler::new()function (or wherever the rate limiters are constructed), spawn a background task:let limiter = endpoint_rate_limiter.clone(); tokio::spawn(async move { let mut interval = tokio::time::interval(Duration::from_secs(300)); loop { interval.tick().await; limiter.cleanup().await; } });. Do the same forLoginRateLimiter. Runcargo testafter. -
Phase 3 verification gate: Run on .198:
cargo clippy --all-targets --all-features(zero warnings),cargo test --all-features(all pass). Search for remaining blocking I/O:grep -rn 'std::fs::' core/archipelago/src/ --include='*.rs' | grep -v test | grep -v target— should return minimal results (only in non-async contexts or test code). Deploy to .198 and verify health.
Phase 4: P1 Frontend — Memory Leaks and Stale State
-
F4 — WebSocket reconnect full state refresh: In
neode-ui/src/stores/app.ts, afterwsClient.connect()succeeds inconnectWebSocket(), immediately callconst freshState = await rpcClient.call({ method: 'server.get-state' })and setdata.value = freshState.datato get fresh state. This ensures no stale patches are applied to outdated base state after a disconnect. Runnpm run type-checkafter. -
F5 — Fix message polling timer lifecycle: In
neode-ui/src/composables/useMessageToast.ts, thepollTimer(setInterval at line 60) is module-level and never cleaned up on logout. Fix: (1) InstartPolling(), check if auth is still valid before polling. (2) InstopPolling(), ensure it's called on logout. (3) Inneode-ui/src/App.vue, find wherestartPollingis called and addstopPolling()to the logout/auth-change handler. (4) Add awatchon the auth state: when it becomes false, callstopPolling(). Runnpm run type-checkandnpm testafter. -
F6 — Fix AppLauncher NIP-07 listener leak: In
neode-ui/src/stores/appLauncher.tslines 295-301, thehandleNostrRequestlistener is added onisOpen=trueand removed onisOpen=false. But if the user navigates away (route change) without closing the overlay, the listener persists. Fix: In theclose()function, explicitly callwindow.removeEventListener('message', handleNostrRequest). Also add a routerbeforeEachguard or useonBeforeUnmountin the component that uses this store to callclose(). Runnpm run type-checkafter. -
F7 — Fix audio player listener stacking: In
neode-ui/src/composables/useAudioPlayer.ts, theplay()function creates a newAudio()element and adds 6 event listeners every time it's called (ifaudio.valueis null). But sinceaudiois a module-level ref, it persists across calls — the issue is that listeners are never removed. Fix: (1) Create the Audio element and listeners once in aninit()function. (2) Use alet initialized = falseflag to prevent re-initialization. (3) Inplay(), just setaudio.value.srcand callaudio.value.play(). Runnpm run type-checkafter. -
S3 — Pin all container images — remove :latest: Across all scripts, replace every
:latesttag with a specific version. Createscripts/image-versions.envas single source of truth:BITCOIN_KNOTS_IMAGE="docker.io/bitcoinknots/bitcoin:v28.1",SEARXNG_IMAGE="docker.io/searxng/searxng:2024.11.17",PHOTOPRISM_IMAGE="docker.io/photoprism/photoprism:240915", etc. Source this file fromfirst-boot-containers.sh,deploy-to-target.sh,deploy-tailscale.sh, andbuild-auto-installer-iso.sh. For custom/local images (lnd-ui, electrs-ui, bitcoin-ui, indeedhub), uselocalhost/{name}:$(git rev-parse --short HEAD)or a date-based tag instead of:latest. Verify withgrep -rn ':latest' scripts/ image-recipe/ | grep -v node_modules | grep -v '#' | grep -v '.md'— should return zero results. -
Phase 4 verification gate: Run
cd neode-ui && npm run type-check(zero errors). Runcd neode-ui && npm test(all pass). Rungrep -rn ':latest' scripts/ image-recipe/ | grep -v node_modules | grep -v '#' | grep -v '.md'— zero results. Deploy to .198 and verify WebSocket reconnection works (kill backend, wait, restart, check UI recovers with fresh data).
Phase 5: P1 Scripts — Deploy Safety and Error Handling
-
S4 — Add error handling to first-boot-containers.sh: The script intentionally avoids
set -efor idempotency. Instead, add per-section checks: After Bitcoin Knots container start, callwait_for_container bitcoin-knots 120and check the return value. If Bitcoin fails, skipcreate_electrumx,create_lnd,create_mempool,create_btcpayby checking aBITCOIN_READY=true/falseflag. Independent apps (Nextcloud, Jellyfin, etc.) always attempt regardless. Add a summary at the end: "Started X/Y containers successfully. Failed: [list]". Test by examining the script logic — no deploy needed for this change. -
S5 — Replace eval with safe variable parsing: In
scripts/deploy-to-target.sharound line 940, findeval "$DB_PASSWORDS". Replace with explicit parsing: read the SSH output line by line, extract key=value pairs withIFS='=' read -r key value, and assign to named variables. This eliminates code injection risk from malformed server output. -
S6 — Add deploy locking: In
scripts/deploy-to-target.sh, near the top (after arg parsing), add:LOCK_FILE="/tmp/archipelago-deploy-${TARGET_HOST}.lock"thenexec 200>"$LOCK_FILE"; flock -n 200 || { echo "ERROR: Deploy already in progress for $TARGET_HOST"; exit 1; }. Add stale lock detection: if lock file mtime is >30 minutes old, break it withrm -f "$LOCK_FILE"before attempting flock. -
S7 — Add deploy rollback: In
scripts/deploy-to-target.sh, before overwriting the backend binary, addssh $SSH_OPTS $TARGET_HOST "cp /usr/local/bin/archipelago /usr/local/bin/archipelago.bak 2>/dev/null || true". Before overwriting frontend, addssh $SSH_OPTS $TARGET_HOST "cp -r /opt/archipelago/web-ui /opt/archipelago/web-ui.bak 2>/dev/null || true". After the health check (curl /health), if it fails 3 times, run rollback:ssh $SSH_OPTS $TARGET_HOST "sudo cp /usr/local/bin/archipelago.bak /usr/local/bin/archipelago; sudo systemctl restart archipelago". -
S8 — Remove sshpass from trust-archipelago-cert.sh: Rewrite
scripts/trust-archipelago-cert.shto use SSH key auth: replace the sshpass block withssh -i ~/.ssh/archipelago-deploy archipelago@${HOST} .... Remove thesshpassdependency check. Keep password only as last-resort fallback with a warning message. -
S9 — Fix MariaDB password on command line: In
scripts/first-boot-containers.sharound line 285,$DOCKER exec archy-mempool-db mariadb -uroot -p$MYSQL_ROOT_PASSexposes the password inpsoutput. Replace with:echo "SELECT 1;" | $DOCKER exec -i archy-mempool-db mariadb -uroot --password="$MYSQL_ROOT_PASS"or better, use a my.cnf file inside the container. -
S17 — Add disk space pre-flight to deploy: In
scripts/deploy-to-target.sh, after SSH key verification, add:DISK_PCT=$(ssh $SSH_OPTS $TARGET_HOST "df / | tail -1 | awk '{print \$(NF-1)}' | tr -d '%'"). IfDISK_PCT > 85, abort with"ERROR: Target disk at ${DISK_PCT}% — need <85% for safe deploy. Free space and retry.". -
Phase 5 verification gate: Run
grep -n 'eval ' scripts/deploy-to-target.sh— should not find the DB_PASSWORDS eval. Rungrep -n 'sshpass' scripts/trust-archipelago-cert.sh— should return zero (or only a fallback warning). Test deploy locking: run two deploys to .198 simultaneously — second should fail with clear message.
Phase 6: P1 Infrastructure + Remaining P1 Backend
-
I2 — Add systemd resource limits: In
image-recipe/configs/archipelago.service, add under[Service]:MemoryMax=4G,LimitNOFILE=65535,TasksMax=2048. These prevent the backend from OOM-killing the system or exhausting file descriptors. Keep existing directives (ProtectSystem, NoNewPrivileges, etc). Deploy config to .198 withscp image-recipe/configs/archipelago.service archipelago@192.168.1.198:/tmp/ && ssh archipelago@192.168.1.198 "sudo cp /tmp/archipelago.service /etc/systemd/system/ && sudo systemctl daemon-reload && sudo systemctl restart archipelago". Verify withssh archipelago@192.168.1.198 "systemctl show archipelago | grep -E 'MemoryMax|LimitNOFILE|TasksMax'". -
I3 — Tor rotation transition period: In
core/archipelago/src/api/rpc/tor.rsaround lines 184-240, thehandle_tor_rotate_service()function deletes the old hidden service directory immediately. Fix: (1) Create the new hidden service in a separate directory first. (2) Wait for the new hostname to appear. (3) Notify federation peers of the new address. (4) Keep the old service running. (5) Schedule deletion of old service after 24 hours usingtokio::time::sleep(Duration::from_secs(86400))in a spawned task. This ensures peers have time to learn the new address before the old one goes dark. Runcargo testafter. -
R14 — Fix .parse().unwrap() in session rate limiting: In
core/archipelago/src/session.rsat lines 665, 676, and 688, replace.parse().unwrap()with.parse().unwrap_or(IpAddr::V4(Ipv4Addr::LOCALHOST))or.parse().context("Invalid IP in rate limiter")?depending on the function signature. If the function returns Result, use?. If not, useunwrap_orwith localhost fallback. Runcargo testafter. -
R15 — Fix 7 unwrap/expect in mesh/protocol.rs: In
core/archipelago/src/mesh/protocol.rs, replace all 7 unwrap/expect calls (lines 582, 592, 614, 649, 679, 713, 728) with proper error propagation using?or.ok_or_else(|| anyhow::anyhow!("descriptive error"))?. These are in protocol parsing — malformed mesh frames should return errors, not panic. Runcargo testafter. -
R27 — Add timeouts to mesh Bitcoin RPC calls: In
core/archipelago/src/mesh/mod.rsat lines 624, 649, and 663, wrap each Bitcoin RPC HTTP call intokio::time::timeout(Duration::from_secs(10), ...). Handle timeout by returning an error to the mesh peer (Bitcoin node unavailable). Runcargo testafter. -
Phase 6 verification gate: Deploy to .198. Run
cargo clippy --all-targets --all-features(zero warnings),cargo test --all-features(all pass). Verify systemd limits:ssh archipelago@192.168.1.198 "systemctl show archipelago | grep MemoryMax"should show4294967296. Rungrep -rn '\.unwrap()' core/archipelago/src/session.rs core/archipelago/src/mesh/protocol.rs | grep -v test | grep -v target— should return zero results in those files.
Phase 7: P2 Backend — Unwraps, Dead Code, Hardcoded Values
-
R13+R16 — Fix startup and identity .expect() calls: In
core/archipelago/src/main.rslines 124 and 159, replace.expect("...")with.context("...")?(function must return Result). Incore/archipelago/src/identity.rslines 114 and 119, replace.expect("pubkey_hex is valid")with.map_err(|e| anyhow::anyhow!("Invalid pubkey hex: {}", e))?. Runcargo test. -
R17+R18+R19 — Fix helpers and js-engine unwraps: In
core/helpers/src/lib.rs, fix 5.unwrap()calls at lines 167, 172, 180, 233, 253 — replace with?or.context(). Incore/helpers/src/rsync.rs, fix 5.unwrap()calls at lines 196, 199, 202, 210, 220. Incore/js-engine/src/lib.rs, fix.unwrap()at lines 130 and 249. Runcargo testafter all changes. -
R20+R21 — Eliminate all dead code suppressions: In
core/archipelago/src/mesh/mod.rs, remove all 14#[allow(dead_code)]annotations (lines 7-25). If the fields/functions are actually used, the code compiles without the annotation. If truly dead, delete them. Checkapi/rpc/lnd.rsline 37,container/data_manager.rsline 69,container/dev_orchestrator.rslines 252/258 for the same pattern. Runcargo clippy— zero warnings required. -
R22-R26 — Centralize hardcoded values: Create
core/archipelago/src/constants.rswith:pub const BITCOIN_RPC_URL: &str = "http://127.0.0.1:8332/";,pub const DWN_HEALTH_URL: &str = "http://127.0.0.1:3100/health";,pub const TOR_SOCKS_PROXY: &str = "socks5h://127.0.0.1:9050";,pub const UPDATE_MANIFEST_URL: &str = "https://raw.githubusercontent.com/...";,pub const DNS_PROVIDERS: &[&str] = &["https://cloudflare-dns.com/dns-query", "https://dns.google/dns-query", "https://dns.quad9.net/dns-query", "https://dns.mullvad.net/dns-query"];, and DWN protocol URIs. Addpub mod constants;tolib.rsormain.rs. Then update all files that hardcode these values to import from constants. Runcargo test. -
R28+R29 — Add timeouts to LND and DWN calls: In
core/archipelago/src/api/rpc/lnd.rs, ensure the reqwest Client used for LND proxy calls has.timeout(Duration::from_secs(15))set on construction (not per-request). Check if there's a shared client or if one is created per call. Incore/archipelago/src/network/dwn_sync.rsline 76, add.timeout(Duration::from_secs(5))to the DWN health check request. Runcargo test. -
R30-R33 — Resolve all TODO comments: (1)
api/rpc/handshake.rs:77— "TODO: track last-seen timestamp": Either implement it (add timestamp field to peer struct) or remove the comment. (2)api/rpc/marketplace.rs:183— "TODO: Add lnd.lookupinvoice": Either implement or remove dead code path. (3)container/health_monitor.rs:140— "TODO: Trigger auto-restart or alert": Either implement or remove. (4)security/container_policies.rs:68— "TODO: Configure Podman to use the profile": Either implement or remove. Per project rules: no TODO in committed code. Runcargo clippy. -
Phase 7 verification gate: Run
cargo clippy --all-targets --all-features— zero warnings. Runcargo test --all-features— all pass. Rungrep -rn 'unwrap\|expect' core/ --include='*.rs' | grep -v test | grep -v target | grep -v 'unwrap_or\|unwrap_err'— review remaining instances. Rungrep -rn 'TODO\|FIXME\|HACK' core/ --include='*.rs' | grep -v target— zero results. Rungrep -rn '127.0.0.1:8332\|127.0.0.1:3100' core/archipelago/src/ --include='*.rs' | grep -v constants.rs | grep -v target— zero results (all using constants).
Phase 8: P2 Frontend — Resilience and Quality
-
F8 — Fix WebSocket reconnection race: In
neode-ui/src/api/websocket.tslines 212-238, add aprivate isReconnecting = falseflag. IndoReconnect(), checkif (this.isReconnecting) return;at the start, setthis.isReconnecting = true, and in the.then()/.catch()ofthis.connect(), set it back tofalse. This prevents twooncloseevents from triggering parallel reconnections. Runnpm run type-check. -
F9 — Handle WebSocket parse errors: In
neode-ui/src/api/websocket.tslines 164-172, the catch block silently swallows JSON parse errors. Add a counter:private parseErrorCount = 0. In the success path, reset to 0. In the catch, increment. IfparseErrorCount > 3, callthis.ws?.close()to trigger reconnection (which will get fresh state per F4 fix). Runnpm run type-check. -
F11 — Reduce RPC client timeout and improve backoff: In
neode-ui/src/api/rpc-client.ts, find the timeout value (likely 30000ms) and reduce to 15000ms. Find the retry backoff delay (likely600 * (attempt + 1)) and add jitter:Math.floor(600 * (attempt + 1) * (0.5 + Math.random() * 0.5)). This prevents thundering herd on server recovery and reduces max wait from 40s to ~20s. Runnpm run type-check. -
F12 — Add code splitting via lazy routes: In
neode-ui/src/router/index.ts, find all route component imports likeimport Web5 from '@/views/Web5.vue'and change toconst Web5 = () => import('@/views/Web5.vue'). Do this for ALL view imports (Web5, Mesh, Dashboard, Settings, Marketplace, Server, Home, AppDetails, Login, Onboarding*, etc.). Keep only the root App.vue as a static import. Then inneode-ui/vite.config.ts, add underbuild::rollupOptions: { output: { manualChunks: { vendor: ['vue', 'vue-router', 'pinia'], api: ['./src/api/rpc-client.ts', './src/api/websocket.ts'] } } }. Runnpm run buildand check that output has multiple chunk files, not one monolithic bundle. -
F13 — Add DOMPurify to QR code v-html: In
neode-ui/src/views/Settings.vuearound line 441, find thev-htmlusage for QR codes. Install DOMPurify if not already:npm install dompurify @types/dompurify. Import it:import DOMPurify from 'dompurify'. Before assigning to the ref:sanitizedQrSvg.value = DOMPurify.sanitize(qrCodeSvg, { USE_PROFILES: { svg: true } }). Verify the package exists first withnpm view dompurify version. Runnpm run type-check. -
F14+F15 — Goals performance + localStorage safety: In
neode-ui/src/stores/goals.ts, replace the O(n)matchesAppIdarray lookup with aMap<string, Set<string>>for instant lookups. For localStorage saves (lines 34-36 and other stores), wrap alllocalStorage.setItem()calls in try/catch:try { localStorage.setItem(...) } catch (e) { console.warn('localStorage full:', e) }. -
Phase 8 verification gate: Run
cd neode-ui && npm run type-check(zero errors). Runcd neode-ui && npm test(all pass). Runcd neode-ui && npm run buildand verify multiple chunks in output (ls -la ../web/dist/neode-ui/assets/*.js | wc -lshould be > 3). Deploy to .198 and navigate all views.
Phase 9: Script Quality + Remaining P2
-
S10 — Replace silent error masking in deploy script: In
scripts/deploy-to-target.sh, find the most critical instances of2>/dev/null || echo ""(health checks, service status). Replace with|| { log_warn "Health check failed for $TARGET_HOST"; echo ""; }. Keep the|| echo ""fallback but add logging before it. Focus on the health check functions first (around lines 234-248). Don't change every instance — just the ones that mask real failures (health, service restart, container status). -
S11 — Add trap cleanup to major scripts: In
scripts/deploy-to-target.sh, add near the top (after set -eo pipefail):TMPDIR="/tmp/archipelago-deploy-$$"; mkdir -p "$TMPDIR"; trap 'rm -rf "$TMPDIR"' EXIT. Use$TMPDIRfor any temp files instead of hardcoded /tmp paths. Do the same forscripts/deploy-tailscale.shandimage-recipe/build-auto-installer-iso.sh. -
S12 — Quote unquoted variables: Run
shellcheck scripts/deploy-to-target.sh scripts/first-boot-containers.sh scripts/deploy-tailscale.sh 2>/dev/null | grep 'SC2086' | head -20to find the most critical unquoted variables. Fix at least the top 20 instances. Double-quote all$VARIABLEreferences in command arguments where word splitting could cause issues. -
S13 — Extract hardcoded IPs to config: Create
scripts/deploy-config-defaults.sh(not gitignored) with:DEFAULT_PRIMARY="192.168.1.228",DEFAULT_SECONDARY="192.168.1.198",TAILSCALE_ARCH1="100.82.97.63",TAILSCALE_ARCH2="100.122.84.60",TAILSCALE_ARCH3="100.124.105.113". Source this file fromdeploy-to-target.sh,deploy-tailscale.sh, and any script that hardcodes IPs. Use the variables instead of literal IPs. -
S15 — Add memory limits to deploy UI containers: In
scripts/deploy-to-target.sh, find where UI containers are created (lines ~842-880: lnd-ui, electrs-ui, bitcoin-ui). Add--memory=256mto each$DOCKER runcommand. These are lightweight nginx containers serving static files — 256MB is generous. -
F16+F17+F18+F19 — Minor frontend fixes: (1)
filebrowser-client.ts: Remove in-memory token, use cookie-only auth. (2)rpc-client.ts: Add header fallback for CSRF token — if cookie not found, checkmeta[name="csrf-token"]. (3)aiPermissions.ts: Add runtime validation when loading from localStorage — validate each item is a valid category string. (4)AppSession.vue:507: Track the setTimeout in aletvariable and clear it inonBeforeUnmount. Runnpm run type-checkafter all. -
Phase 9 verification gate: Run
grep -c 'trap.*EXIT' scripts/deploy-to-target.sh scripts/deploy-tailscale.sh— both should return 1. Deploy to .198 and verify all UI containers have memory limits:ssh -i ~/.ssh/archipelago-deploy archipelago@192.168.1.198 "podman inspect --format '{{.HostConfig.Memory}}' archy-lnd-ui archy-electrs-ui archy-bitcoin-ui 2>/dev/null".
Phase 10: Backend Architecture — Split God Files (Part 1)
-
R35 — Split package.rs into submodules (Part 1: Extract config.rs): Create
core/archipelago/src/api/rpc/package/directory. Createmod.rsthat re-exports everything from the original. Move ALLget_app_config(),get_app_capabilities(),needs_archy_net(), and related constant/lookup functions intoconfig.rs. The originalpackage.rsimports from the new module. Runcargo test— all must pass. Runcargo clippy— zero warnings. -
R35 — Split package.rs (Part 2: Extract validation.rs): Move input validation functions (app ID validation, dependency checking, image name validation) from
package.rsintopackage/validation.rs. Update imports. Runcargo test. -
R35 — Split package.rs (Part 3: Extract lifecycle.rs): Move install, start, stop, restart, uninstall operations into
package/lifecycle.rs. Move progress streaming intopackage/progress.rs. The remainingpackage.rs(orpackage/mod.rs) should be a thin dispatcher under 200 lines that delegates to the sub-modules. Runcargo test— all existing RPC calls must return identical responses. -
R36 — Split mesh/listener.rs into submodules: Create
core/archipelago/src/mesh/listener/directory. Extract: (1)session.rs—run_mesh_session()loop. (2)frames.rs—handle_frame()dispatcher. (3)identity.rs—handle_identity_received(),handle_typed_message(). (4)sync.rs—sync_queued_messages(),store_typed_message(). (5)bitcoin.rs— Bitcoin relay RPC operations. Keepmod.rsas the entry point withspawn_mesh_listener(). No file should exceed 500 lines. Runcargo test. -
R37 — Split rpc/mod.rs into submodules: Extract: (1)
dispatcher.rs— method name → handler routing match statement. (2)middleware.rs— CSRF validation, session checking, rate limiting logic. (3)response.rs— response building, error formatting. Keepmod.rsas the thin entry point that wires everything together. No file > 500 lines. Runcargo test. -
R38 — Split lnd.rs into submodules: Create
api/rpc/lnd/directory. Extract: (1)wallet.rs— balance, send, receive, invoices. (2)channels.rs— open, close, list channels. (3)info.rs— node info, network info, connection strings. (4)payments.rs— payment history, routing. No file > 500 lines. Runcargo test. -
Phase 10 verification gate: Run
cargo clippy --all-targets --all-features— zero warnings. Runcargo test --all-features— all pass. Check file sizes:find core/archipelago/src/ -name '*.rs' -exec wc -l {} + | sort -rn | head -20— no file should exceed 600 lines (allowing some margin). Deploy to .198 and verify all RPC endpoints work.
Phase 11: Frontend Architecture — Split God Components (Part 1)
-
F25 — Split Web5.vue (Part 1: Router shell + Identity): In
neode-ui/src/router/index.ts, add nested routes under the Web5 route:{ path: 'identity', component: () => import('@/views/web5/Web5Identity.vue') }, etc. Createneode-ui/src/views/web5/Web5.vueas a layout shell (~150 lines) with<router-view />for sub-views. Extract the DID management section intoWeb5Identity.vue. Ensure the route transition works smoothly. Runnpm run type-check. -
F25 — Split Web5.vue (Part 2: Extract remaining sections): Create
Web5Wallet.vue(wallet operations),Web5Nostr.vue(Nostr relays/profiles),Web5Credentials.vue(Verifiable Credentials),Web5Peers.vue(P2P federation),Web5Storage.vue(DWN storage/explorer),Web5Goals.vue(goals/voting),Web5Marketplace.vue(decentralized marketplace). Each should be under 500 lines. Move shared state to composables if needed (e.g.,useWeb5Identity()). Runnpm run type-checkandnpm test. -
F26 — Split Mesh.vue into submodules: Create
views/mesh/Mesh.vueas layout with tabs. Extract:MeshRadio.vue(radio status, device connection),MeshChat.vue(chat interface, messages),MeshNetwork.vue(topology, peers),MeshFederation.vue(federation sync). Add nested routes. No component > 500 lines. Runnpm run type-check. -
F27 — Split Dashboard.vue into submodules: Create
views/dashboard/Dashboard.vueas sidebar + router-view shell. Extract:DashboardHome.vue(overview cards),DashboardApps.vue(running apps, quick actions),DashboardSystem.vue(CPU/RAM/disk stats). Runnpm run type-check. -
F28 — Split Settings.vue into submodules: Create
views/settings/Settings.vueas tab navigation shell. Extract:SettingsAccount.vue(password, 2FA, sessions),SettingsSystem.vue(server name, reboot, updates),SettingsNetwork.vue(Tor, Tailscale),SettingsAppearance.vue(theme, screensaver). Runnpm run type-check. -
Phase 11 verification gate: Run
cd neode-ui && npm run type-check— zero errors. Runnpm test— all pass. Check component sizes:find neode-ui/src/views -name '*.vue' -exec wc -l {} + | sort -rn | head -20— no component should exceed 600 lines. Deploy to .198 and navigate every section of Web5, Mesh, Dashboard, Settings.
Phase 12: Frontend Architecture — Split God Components (Part 2)
-
F29+F30+F31+F32 — Split remaining large views: Split
Marketplace.vue(1,293 lines) intomarketplace/MarketplaceGrid.vue,MarketplaceFilters.vue,MarketplaceInstall.vue. SplitServer.vue(1,132 lines) intoserver/ServerOverview.vue,ServerContainers.vue,ServerLogs.vue. SplitHome.vue(1,059 lines) intohome/HomeOverview.vue,HomeApps.vue,HomeStatus.vue. SplitAppDetails.vue(1,036 lines) intoapp/AppOverview.vue,AppLogs.vue,AppConfig.vue. Runnpm run type-checkafter each split. -
F33 — Decompose useAppStore into focused stores: Create:
stores/auth.ts(login, logout, session, password, TOTP — ~100 lines),stores/server.ts(server info, stats, reboot/shutdown — ~80 lines),stores/realtime.ts(WebSocket connection, subscriptions, heartbeat — ~80 lines),stores/packages.ts(package install/uninstall, marketplace — ~80 lines). Keepstores/app.tsas a thin re-export:export { useAuthStore } from './auth'; export { useServerStore } from './server'; ...plus auseAppStore()function that returns a composed object for backward compatibility. Runnpm run type-checkandnpm test. -
F20+F21+F22+F23+F24 — Remaining P3 frontend fixes: (1) Dashboard.vue: add
aria-current="page"to active RouterLink. (2) Apps.vue: debounce search input (150ms) and memoize lowercase strings. (3) style.css: add@media (max-width: 768px) { .glass-card, .glass-button { backdrop-filter: blur(8px); } }to reduce mobile GPU load. (4) types/api.ts: replaceRecord<string, unknown>for DID operations with branded types. (5) websocket.ts: trackcheckIntervaland clear in all paths. Runnpm run type-check. -
Phase 12 verification gate: Run
cd neode-ui && npm run type-check— zero errors. Runnpm test— all pass.find neode-ui/src/views -name '*.vue' -exec wc -l {} + | sort -rn | head -10— no component > 600 lines.wc -l neode-ui/src/stores/app.ts— should be under 100 lines (thin re-export). Deploy to .198 and navigate all views.
Phase 13: Script Architecture — Shared Library + Splits
-
S21 — Create shared script library: Create
scripts/lib/common.shwith functions extracted from duplicated patterns:log_info(),log_warn(),log_error()(colored logging),ssh_cmd()(SSH wrapper with key),wait_for_health()(health poll loop),check_disk_space(),mem_limit()(memory limit calculator). Source it from deploy-to-target.sh, first-boot-containers.sh, deploy-tailscale.sh. Run each script with--dry-runor--helpto verify sourcing works. -
S18 — Split deploy-to-target.sh (Part 1): Create
scripts/deploy/frontend.sh— extract frontend build + sync logic. Createscripts/deploy/backend.sh— extract backend build + sync logic. Keepdeploy-to-target.shas orchestrator that sourceslib/common.shand calls the sub-scripts. Target: orchestrator < 400 lines, each sub-script < 300 lines. Test with./scripts/deploy-to-target.sh --dry-run --target 192.168.1.198. -
S18 — Split deploy-to-target.sh (Part 2): Extract
scripts/deploy/configs.sh(nginx, systemd, script sync),scripts/deploy/containers.sh(container creation/update),scripts/deploy/verify.sh(post-deploy health checks),scripts/deploy/rollback.sh(rollback on failure). No file > 400 lines. -
S19 — Split build-auto-installer-iso.sh: Create
image-recipe/build/capture-images.sh,build/create-rootfs.sh,build/install-packages.sh,build/bundle-configs.sh,build/package-iso.sh. Keep orchestrator under 300 lines. -
S20 — Split first-boot-containers.sh: Create
scripts/first-boot/databases.sh(MariaDB, PostgreSQL, Redis),first-boot/bitcoin.sh(Bitcoin Knots, ElectrumX),first-boot/lightning.sh(LND, BTCPay),first-boot/apps.sh(Nextcloud, Jellyfin, etc.),first-boot/networking.sh(Tor, Tailscale). Each sourceslib/common.sh. No file > 300 lines. -
S16 — Make ISO builds reproducible: Create
scripts/image-versions.envwith pinned digests:BITCOIN_IMAGE="docker.io/bitcoinknots/bitcoin:v28.1@sha256:...". Source this in build-auto-installer-iso.sh. Never fall back to:latest. Add a manifest file to ISO output recording exact image digests shipped. -
Phase 13 verification gate:
wc -l scripts/deploy-to-target.sh< 400.wc -l scripts/first-boot-containers.sh< 300.wc -l image-recipe/build-auto-installer-iso.sh< 300.grep -rn ':latest' scripts/ image-recipe/ | grep -v node_modules | grep -v '#' | grep -v '.md'— zero results. Test deploy:./scripts/deploy-to-target.sh --dry-run --target 192.168.1.198— succeeds.
Phase 14: Integration Tests
-
Backend integration tests (Part 1): Create
core/archipelago/tests/test_auth_flow.rs— test login → session → CSRF → authenticated request → logout. Createtest_rpc_validation.rs— test every public endpoint with invalid input → proper error code. Createtest_session_persist.rs— create session → simulate restart → session survives. Createtest_rate_limiting.rs— flood endpoint → 429 → wait → allowed. Runcargo test --all-features. -
Backend integration tests (Part 2): Create
test_container_lifecycle.rs— install → start → health → stop → uninstall (mock Podman). Createtest_backup_restore.rs— create backup → verify integrity → restore to staging → validate. Createtest_health_endpoint.rs— healthy → degraded → recovery transitions. Target: 25+ tests passing. -
Frontend integration tests: Create
neode-ui/src/__tests__/integration/auth-flow.spec.ts— login → dashboard → timeout → redirect. Createapp-lifecycle.spec.ts— marketplace → install → progress → launch → uninstall. Createwebsocket.spec.ts— connect → update → disconnect → reconnect → state consistent. Createerror-handling.spec.ts— network error → toast → retry → success. Createsettings-flow.spec.ts— password change → re-login → 2FA setup. Target: 20+ tests passing. Runnpm test. -
E2E smoke test script: Create
scripts/smoke-test.shthat runs against .198. Tests: (1)curl /health→ OK. (2) Login via RPC → get session. (3)server.get-info→ valid JSON. (4)container.list→ success. (5) Check every/app/*proxy responds. (6) Check WebSocket upgrade (101). (7) Check Tor hidden service if available. Exit 0 only if all pass. Make executable. Run against .198. -
Phase 14 verification gate:
cargo test --all-features— 25+ tests pass (count withcargo test --all-features 2>&1 | grep 'test result').cd neode-ui && npm test -- --reporter=verbose 2>&1 | grep -c 'PASS\|✓'— 20+ tests../scripts/smoke-test.sh 192.168.1.198— exits 0.
Phase 15: Type Sync + CI/CD Documentation
-
Rust→TypeScript type generation: Add
ts-rs = "10"tocore/models/Cargo.toml(verify it exists first withcargo search ts-rs). Add#[derive(TS)]and#[ts(export)]to all API request/response structs incore/models/src/. Create a build script or test that generates TypeScript toneode-ui/src/types/generated.ts. Replace manual types inneode-ui/src/types/api.tswith imports fromgenerated.tswhere applicable. Run bothcargo testandnpm run type-checkto verify. -
Document CI/CD pipeline plan: Create
docs/ci-cd-plan.mddocumenting the planned GitHub Actions CI/CD setup. Include: (1) CI workflow (triggers: push to main + PRs; jobs: cargo clippy, cargo fmt --check, cargo test, npm type-check, npm lint, npm test; merge policy: all checks must pass). (2) Release workflow (triggers: tag push v*; jobs: Linux binary cross-compile, frontend build, ISO build via SSH, QEMU verification). (3) Pre-requisites list (GitHub Actions runners, Rust toolchain, SSH key for build server, branch protection rules, image digest manifest). (4) Estimated implementation time: 2 weeks. This is documentation only — do not implement CI/CD yet. -
Final verification sweep: Run ALL verification gates from every phase. Deploy to .198. Run smoke test. Verify: no Rust file > 500 lines, no Vue component > 500 lines, no script > 400 lines, no store > 1 responsibility, zero unwraps in production, zero :latest tags, zero sudo podman, zero blocking I/O in async, zero TODO comments. Document results.
-
Update architecture docs: Update
docs/architecture-review.htmltech debt map and quality scores to reflect all completed work. Updatedocs/architecture.mdcodebase stats. Updatedocs/BETA-PROGRESS.mdwith completion status. Commit withdocs: update architecture review with completed refactoring.