lfg2025/archy

Files

Dorian 01942cea95 docs: mark all overnight plan tasks complete

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-03-21 03:08:52 +00:00

44 KiB

Raw Permalink Blame History

Overnight Plan — Production Excellence

Systematically fix every production-readiness issue across the Archipelago codebase. Each task is self-contained and behavior-preserving. Full issue registry and architectural context in .claude/plans/plan.md. CRITICAL: Deploy ONLY to .198 (ssh -i ~/.ssh/archipelago-deploy archipelago@192.168.1.198). Never deploy to .228. Follow all rules in CLAUDE.md. Atomic commits with fix: or refactor: prefix.

Phase 1: P0 Backend — Hangs, Data Loss, Missing Handlers

R1 — Add health RPC endpoint handler: In core/archipelago/src/api/rpc/mod.rs, the "health" method is listed in UNAUTHENTICATED_METHODS (line 113) but has NO match arm in the dispatcher — it returns "Unknown method" error. Add a handler that returns JSON with: {"status": "ok", "crash_recovery_complete": bool, "uptime_seconds": u64, "version": "..."}. Check if crash recovery is done via the server state. Return "degraded" status if recovery is still in progress. Test by running curl -s -X POST http://192.168.1.198/rpc/v1 -H 'Content-Type: application/json' -d '{"method":"health"}' and verifying real status JSON (not an error). Run cargo clippy --all-targets --all-features and cargo test --all-features on the dev server after changes.
R2 — Add timeout to Nostr client.connect(): In core/archipelago/src/nostr_handshake.rs, there are 4 calls to client.connect().await with NO timeout at lines 124, 161, 262, and 282. If a Nostr relay is down, these hang forever. Wrap each one in tokio::time::timeout(Duration::from_secs(10), client.connect()).await. Handle the timeout error by logging a warning and continuing (Nostr is best-effort). Note that fetch_events() already has timeouts (lines 168, 370) — match that pattern. Run cargo clippy and cargo test after.
R3 — Make backup restore atomic with rollback: In core/archipelago/src/backup/full.rs lines 122-149, restore_full_backup() extracts tar directly to the live data directory. If extraction fails halfway, the system is left in corrupt partial state. Fix by: (1) Check disk space before starting — at least 2x backup size free. (2) Extract to a staging directory (data_dir.join(".restore-staging")). (3) Validate the staging dir has required files (identity/, sessions.json at minimum). (4) Rename current data_dir contents to .restore-backup. (5) Move staging contents to data_dir. (6) On any failure, restore from .restore-backup. (7) Clean up staging/backup dirs on success. Use tokio::fs for all operations. Run cargo test after.
I1 — Protect unauthenticated nginx endpoints: In image-recipe/configs/nginx-archipelago.conf, the /archipelago/ location block (lines 116-121), /content (line 166+), and /dwn (line 176+) have NO timeout, rate-limit, or body size protection. Add to each of these three location blocks: limit_req zone=rpc burst=20 nodelay;, client_max_body_size 10m;, proxy_connect_timeout 30s;, proxy_read_timeout 60s;, proxy_send_timeout 30s;. If a limit_req_zone named rpc doesn't exist, check the existing zones at the top of the config and either use an existing one or add limit_req_zone $binary_remote_addr zone=peer:10m rate=10r/s; in the http block and reference zone=peer. Verify syntax with nginx -t on .198 after deploying.
Phase 1 verification gate: SSH to .198 and run: cargo clippy --all-targets --all-features (zero warnings), cargo test --all-features (all pass). Deploy with ./scripts/deploy-to-target.sh --target 192.168.1.198. Then run curl http://192.168.1.198/health and verify it returns real JSON status. Run curl -X POST http://192.168.1.198/rpc/v1 -H 'Content-Type: application/json' -d '{"method":"health"}' and verify JSON response with status field.

Phase 2: P0 Frontend — Race Conditions and Silent Failures

F1 — Fix WebSocket subscription race condition: In neode-ui/src/stores/app.ts lines 88-134, connectWebSocket() uses a local isWsSubscribed flag to prevent double-subscription, but two rapid calls can both pass the check before either sets it. Fix: (1) Move isWsSubscribed to a module-level let variable initialized to false (if not already). (2) Add an early return if WebSocket is already connecting: if (wsClient.isConnecting()) return;. (3) Before subscribing, call wsClient.unsubscribeAll() to clear any prior callbacks, THEN subscribe fresh. This ensures exactly one callback is active regardless of how many times connectWebSocket() is called. Run cd neode-ui && npm run type-check after.
F2 — Protect mesh store concurrent mutations: In neode-ui/src/stores/mesh.ts, sendMessage() (line 249), sendInvoice(), and sendCoordinate() all call fetchMessages() after sending, but multiple concurrent calls can race. Fix: Add a const sendQueue = ref<Promise<void>>(Promise.resolve()) at module level. Each send function chains onto it: sendQueue.value = sendQueue.value.then(() => doSend(...)). This serializes sends so fetchMessages() is never called concurrently. Run npm run type-check after.
F3 — Add global Vue error handler: In neode-ui/src/main.ts, there is no app.config.errorHandler. Any component error causes a white screen. After const app = createApp(App), add: app.config.errorHandler = (err, instance, info) => { console.error('[Vue Error]', err, info); const { showError } = useToast(); showError('Something went wrong. Please refresh the page.'); };. Import useToast from @/composables/useToast. Run npm run type-check after.
S1 — Eliminate all sudo podman in scripts: Run grep -rn 'sudo podman' scripts/ indeedhub/ image-recipe/ to find all instances. In each script, replace sudo podman with podman. The main offenders are: scripts/fix-indeedhub-containers.sh (28 instances), scripts/deploy-bitcoin-knots.sh (11 instances), scripts/deploy-tailscale.sh (check for any remaining), scripts/uptime-monitor.sh, scripts/setup-aiui-server.sh. After replacing, verify with grep -rn 'sudo podman' scripts/ indeedhub/ image-recipe/ — should return zero results. Do NOT change docs/ files (those are historical records).
S2 — Add health checks to all containers in first-boot: In scripts/first-boot-containers.sh, every $DOCKER run command needs --health-cmd, --health-interval=30s, --health-timeout=5s, --health-retries=3. Use appropriate health commands: For Bitcoin Knots (line ~253): --health-cmd="bitcoin-cli -rpcuser=\$BITCOIN_RPC_USER -rpcpassword=\$BITCOIN_RPC_PASS getblockchaininfo || exit 1". For HTTP apps (Mempool, BTCPay, Grafana, etc.): --health-cmd="curl -sf http://localhost:{PORT}/ || exit 1". For LND: --health-cmd="curl -sf --insecure https://localhost:8080/v1/getinfo || exit 1". For databases (MariaDB): --health-cmd="mariadb -uroot -e 'SELECT 1' || exit 1". For ElectrumX: --health-cmd="curl -sf http://localhost:50002/ || exit 1". After editing, verify with grep -c 'health-cmd' scripts/first-boot-containers.sh — should match the number of $DOCKER run commands.
Phase 2 verification gate: Run cd neode-ui && npm run type-check (zero errors). Run cd neode-ui && npm test (all pass). Run grep -rn 'sudo podman' scripts/ indeedhub/ image-recipe/ | grep -v docs/ | grep -v '#' and verify zero results. Run grep -c 'health-cmd' scripts/first-boot-containers.sh and verify count matches number of container run commands.

Phase 3: P1 Backend — Blocking I/O in Async Context

R6 — Fix session.rs blocking I/O (6 calls): In core/archipelago/src/session.rs, replace: (1) Line 77: std::fs::read_to_string() → tokio::fs::read_to_string().await. (2) Line 128: std::fs::write() → tokio::fs::write().await. (3) Line 370: std::fs::read() → tokio::fs::read().await. (4) Line 413: std::fs::read() → tokio::fs::read().await. (5) Line 423: std::fs::create_dir_all() → tokio::fs::create_dir_all().await. (6) Line 425: std::fs::write() → tokio::fs::write().await. Make sure the containing functions are async fn. Add use tokio::fs; at top if not present. Run cargo clippy and cargo test after.
R7 — Fix docker_packages.rs blocking I/O: In core/archipelago/src/container/docker_packages.rs, replace std::fs::read_to_string() at lines 561 and 573 with tokio::fs::read_to_string().await. Ensure the containing function is async. Run cargo test after.
R8 — Fix port_allocator.rs blocking I/O: In core/archipelago/src/port_allocator.rs, replace: (1) Line 59: std::fs::read_to_string() → tokio::fs::read_to_string().await. (2) Line 73: std::fs::create_dir_all() → tokio::fs::create_dir_all().await. (3) Line 77: std::fs::write() → tokio::fs::write().await. Run cargo test after.
R9+R10+R11 — Fix remaining blocking I/O across 5 files: (1) core/archipelago/src/peers.rs line 30: fs::read_to_string() → tokio::fs::read_to_string().await. (2) core/archipelago/src/node_message.rs line 65: std::fs::write() → tokio::fs::write().await. (3) core/archipelago/src/identity.rs line 50: fs::set_permissions() → tokio::fs::set_permissions().await. (4) core/archipelago/src/identity_manager.rs line 164: fs::set_permissions() → tokio::fs::set_permissions().await. (5) core/archipelago/src/nostr_discovery.rs line 55: std::fs::set_permissions() → tokio::fs::set_permissions().await. Run cargo clippy and cargo test after all changes.
R12 — Fix electrs_status.rs sync TCP I/O: In core/archipelago/src/electrs_status.rs, the entire module uses synchronous TCP I/O (std::net::TcpStream, BufReader, write_all). Convert to async using tokio::net::TcpStream and tokio::io::{AsyncBufReadExt, AsyncWriteExt}. Replace std::fs::read_dir() at line 40 with tokio::fs::read_dir().await. Wrap the TCP connection in a tokio::time::timeout(Duration::from_secs(5), ...) to prevent hangs if ElectrumX is down. Run cargo clippy and cargo test after.
R4+R5 — Spawn rate limiter cleanup tasks: In core/archipelago/src/session.rs, the EndpointRateLimiter::cleanup() method (lines 566-579) and LoginRateLimiter cleanup exist but are never called. In the RpcHandler::new() function (or wherever the rate limiters are constructed), spawn a background task: let limiter = endpoint_rate_limiter.clone(); tokio::spawn(async move { let mut interval = tokio::time::interval(Duration::from_secs(300)); loop { interval.tick().await; limiter.cleanup().await; } });. Do the same for LoginRateLimiter. Run cargo test after.
Phase 3 verification gate: Run on .198: cargo clippy --all-targets --all-features (zero warnings), cargo test --all-features (all pass). Search for remaining blocking I/O: grep -rn 'std::fs::' core/archipelago/src/ --include='*.rs' | grep -v test | grep -v target — should return minimal results (only in non-async contexts or test code). Deploy to .198 and verify health.

Phase 4: P1 Frontend — Memory Leaks and Stale State

F4 — WebSocket reconnect full state refresh: In neode-ui/src/stores/app.ts, after wsClient.connect() succeeds in connectWebSocket(), immediately call const freshState = await rpcClient.call({ method: 'server.get-state' }) and set data.value = freshState.data to get fresh state. This ensures no stale patches are applied to outdated base state after a disconnect. Run npm run type-check after.
F5 — Fix message polling timer lifecycle: In neode-ui/src/composables/useMessageToast.ts, the pollTimer (setInterval at line 60) is module-level and never cleaned up on logout. Fix: (1) In startPolling(), check if auth is still valid before polling. (2) In stopPolling(), ensure it's called on logout. (3) In neode-ui/src/App.vue, find where startPolling is called and add stopPolling() to the logout/auth-change handler. (4) Add a watch on the auth state: when it becomes false, call stopPolling(). Run npm run type-check and npm test after.
F6 — Fix AppLauncher NIP-07 listener leak: In neode-ui/src/stores/appLauncher.ts lines 295-301, the handleNostrRequest listener is added on isOpen=true and removed on isOpen=false. But if the user navigates away (route change) without closing the overlay, the listener persists. Fix: In the close() function, explicitly call window.removeEventListener('message', handleNostrRequest). Also add a router beforeEach guard or use onBeforeUnmount in the component that uses this store to call close(). Run npm run type-check after.
F7 — Fix audio player listener stacking: In neode-ui/src/composables/useAudioPlayer.ts, the play() function creates a new Audio() element and adds 6 event listeners every time it's called (if audio.value is null). But since audio is a module-level ref, it persists across calls — the issue is that listeners are never removed. Fix: (1) Create the Audio element and listeners once in an init() function. (2) Use a let initialized = false flag to prevent re-initialization. (3) In play(), just set audio.value.src and call audio.value.play(). Run npm run type-check after.
S3 — Pin all container images — remove :latest: Across all scripts, replace every :latest tag with a specific version. Create scripts/image-versions.env as single source of truth: BITCOIN_KNOTS_IMAGE="docker.io/bitcoinknots/bitcoin:v28.1", SEARXNG_IMAGE="docker.io/searxng/searxng:2024.11.17", PHOTOPRISM_IMAGE="docker.io/photoprism/photoprism:240915", etc. Source this file from first-boot-containers.sh, deploy-to-target.sh, deploy-tailscale.sh, and build-auto-installer-iso.sh. For custom/local images (lnd-ui, electrs-ui, bitcoin-ui, indeedhub), use localhost/{name}:$(git rev-parse --short HEAD) or a date-based tag instead of :latest. Verify with grep -rn ':latest' scripts/ image-recipe/ | grep -v node_modules | grep -v '#' | grep -v '.md' — should return zero results.
Phase 4 verification gate: Run cd neode-ui && npm run type-check (zero errors). Run cd neode-ui && npm test (all pass). Run grep -rn ':latest' scripts/ image-recipe/ | grep -v node_modules | grep -v '#' | grep -v '.md' — zero results. Deploy to .198 and verify WebSocket reconnection works (kill backend, wait, restart, check UI recovers with fresh data).

Phase 5: P1 Scripts — Deploy Safety and Error Handling

S4 — Add error handling to first-boot-containers.sh: The script intentionally avoids set -e for idempotency. Instead, add per-section checks: After Bitcoin Knots container start, call wait_for_container bitcoin-knots 120 and check the return value. If Bitcoin fails, skip create_electrumx, create_lnd, create_mempool, create_btcpay by checking a BITCOIN_READY=true/false flag. Independent apps (Nextcloud, Jellyfin, etc.) always attempt regardless. Add a summary at the end: "Started X/Y containers successfully. Failed: [list]". Test by examining the script logic — no deploy needed for this change.
S5 — Replace eval with safe variable parsing: In scripts/deploy-to-target.sh around line 940, find eval "$DB_PASSWORDS". Replace with explicit parsing: read the SSH output line by line, extract key=value pairs with IFS='=' read -r key value, and assign to named variables. This eliminates code injection risk from malformed server output.
S6 — Add deploy locking: In scripts/deploy-to-target.sh, near the top (after arg parsing), add: LOCK_FILE="/tmp/archipelago-deploy-${TARGET_HOST}.lock" then exec 200>"$LOCK_FILE"; flock -n 200 || { echo "ERROR: Deploy already in progress for $TARGET_HOST"; exit 1; }. Add stale lock detection: if lock file mtime is >30 minutes old, break it with rm -f "$LOCK_FILE" before attempting flock.
S7 — Add deploy rollback: In scripts/deploy-to-target.sh, before overwriting the backend binary, add ssh $SSH_OPTS $TARGET_HOST "cp /usr/local/bin/archipelago /usr/local/bin/archipelago.bak 2>/dev/null || true". Before overwriting frontend, add ssh $SSH_OPTS $TARGET_HOST "cp -r /opt/archipelago/web-ui /opt/archipelago/web-ui.bak 2>/dev/null || true". After the health check (curl /health), if it fails 3 times, run rollback: ssh $SSH_OPTS $TARGET_HOST "sudo cp /usr/local/bin/archipelago.bak /usr/local/bin/archipelago; sudo systemctl restart archipelago".
S8 — Remove sshpass from trust-archipelago-cert.sh: Rewrite scripts/trust-archipelago-cert.sh to use SSH key auth: replace the sshpass block with ssh -i ~/.ssh/archipelago-deploy archipelago@${HOST} .... Remove the sshpass dependency check. Keep password only as last-resort fallback with a warning message.
S9 — Fix MariaDB password on command line: In scripts/first-boot-containers.sh around line 285, $DOCKER exec archy-mempool-db mariadb -uroot -p$MYSQL_ROOT_PASS exposes the password in ps output. Replace with: echo "SELECT 1;" | $DOCKER exec -i archy-mempool-db mariadb -uroot --password="$MYSQL_ROOT_PASS" or better, use a my.cnf file inside the container.
S17 — Add disk space pre-flight to deploy: In scripts/deploy-to-target.sh, after SSH key verification, add: DISK_PCT=$(ssh $SSH_OPTS $TARGET_HOST "df / | tail -1 | awk '{print \$(NF-1)}' | tr -d '%'"). If DISK_PCT > 85, abort with "ERROR: Target disk at ${DISK_PCT}% — need <85% for safe deploy. Free space and retry.".
Phase 5 verification gate: Run grep -n 'eval ' scripts/deploy-to-target.sh — should not find the DB_PASSWORDS eval. Run grep -n 'sshpass' scripts/trust-archipelago-cert.sh — should return zero (or only a fallback warning). Test deploy locking: run two deploys to .198 simultaneously — second should fail with clear message.

Phase 6: P1 Infrastructure + Remaining P1 Backend

I2 — Add systemd resource limits: In image-recipe/configs/archipelago.service, add under [Service]: MemoryMax=4G, LimitNOFILE=65535, TasksMax=2048. These prevent the backend from OOM-killing the system or exhausting file descriptors. Keep existing directives (ProtectSystem, NoNewPrivileges, etc). Deploy config to .198 with scp image-recipe/configs/archipelago.service archipelago@192.168.1.198:/tmp/ && ssh archipelago@192.168.1.198 "sudo cp /tmp/archipelago.service /etc/systemd/system/ && sudo systemctl daemon-reload && sudo systemctl restart archipelago". Verify with ssh archipelago@192.168.1.198 "systemctl show archipelago | grep -E 'MemoryMax|LimitNOFILE|TasksMax'".
I3 — Tor rotation transition period: In core/archipelago/src/api/rpc/tor.rs around lines 184-240, the handle_tor_rotate_service() function deletes the old hidden service directory immediately. Fix: (1) Create the new hidden service in a separate directory first. (2) Wait for the new hostname to appear. (3) Notify federation peers of the new address. (4) Keep the old service running. (5) Schedule deletion of old service after 24 hours using tokio::time::sleep(Duration::from_secs(86400)) in a spawned task. This ensures peers have time to learn the new address before the old one goes dark. Run cargo test after.
R14 — Fix .parse().unwrap() in session rate limiting: In core/archipelago/src/session.rs at lines 665, 676, and 688, replace .parse().unwrap() with .parse().unwrap_or(IpAddr::V4(Ipv4Addr::LOCALHOST)) or .parse().context("Invalid IP in rate limiter")? depending on the function signature. If the function returns Result, use ?. If not, use unwrap_or with localhost fallback. Run cargo test after.
R15 — Fix 7 unwrap/expect in mesh/protocol.rs: In core/archipelago/src/mesh/protocol.rs, replace all 7 unwrap/expect calls (lines 582, 592, 614, 649, 679, 713, 728) with proper error propagation using ? or .ok_or_else(|| anyhow::anyhow!("descriptive error"))?. These are in protocol parsing — malformed mesh frames should return errors, not panic. Run cargo test after.
R27 — Add timeouts to mesh Bitcoin RPC calls: In core/archipelago/src/mesh/mod.rs at lines 624, 649, and 663, wrap each Bitcoin RPC HTTP call in tokio::time::timeout(Duration::from_secs(10), ...). Handle timeout by returning an error to the mesh peer (Bitcoin node unavailable). Run cargo test after.
Phase 6 verification gate: Deploy to .198. Run cargo clippy --all-targets --all-features (zero warnings), cargo test --all-features (all pass). Verify systemd limits: ssh archipelago@192.168.1.198 "systemctl show archipelago | grep MemoryMax" should show 4294967296. Run grep -rn '\.unwrap()' core/archipelago/src/session.rs core/archipelago/src/mesh/protocol.rs | grep -v test | grep -v target — should return zero results in those files.

Phase 7: P2 Backend — Unwraps, Dead Code, Hardcoded Values

R13+R16 — Fix startup and identity .expect() calls: In core/archipelago/src/main.rs lines 124 and 159, replace .expect("...") with .context("...")? (function must return Result). In core/archipelago/src/identity.rs lines 114 and 119, replace .expect("pubkey_hex is valid") with .map_err(|e| anyhow::anyhow!("Invalid pubkey hex: {}", e))?. Run cargo test.
R17+R18+R19 — Fix helpers and js-engine unwraps: In core/helpers/src/lib.rs, fix 5 .unwrap() calls at lines 167, 172, 180, 233, 253 — replace with ? or .context(). In core/helpers/src/rsync.rs, fix 5 .unwrap() calls at lines 196, 199, 202, 210, 220. In core/js-engine/src/lib.rs, fix .unwrap() at lines 130 and 249. Run cargo test after all changes.
R20+R21 — Eliminate all dead code suppressions: In core/archipelago/src/mesh/mod.rs, remove all 14 #[allow(dead_code)] annotations (lines 7-25). If the fields/functions are actually used, the code compiles without the annotation. If truly dead, delete them. Check api/rpc/lnd.rs line 37, container/data_manager.rs line 69, container/dev_orchestrator.rs lines 252/258 for the same pattern. Run cargo clippy — zero warnings required.
R22-R26 — Centralize hardcoded values: Create core/archipelago/src/constants.rs with: pub const BITCOIN_RPC_URL: &str = "http://127.0.0.1:8332/";, pub const DWN_HEALTH_URL: &str = "http://127.0.0.1:3100/health";, pub const TOR_SOCKS_PROXY: &str = "socks5h://127.0.0.1:9050";, pub const UPDATE_MANIFEST_URL: &str = "https://raw.githubusercontent.com/...";, pub const DNS_PROVIDERS: &[&str] = &["https://cloudflare-dns.com/dns-query", "https://dns.google/dns-query", "https://dns.quad9.net/dns-query", "https://dns.mullvad.net/dns-query"];, and DWN protocol URIs. Add pub mod constants; to lib.rs or main.rs. Then update all files that hardcode these values to import from constants. Run cargo test.
R28+R29 — Add timeouts to LND and DWN calls: In core/archipelago/src/api/rpc/lnd.rs, ensure the reqwest Client used for LND proxy calls has .timeout(Duration::from_secs(15)) set on construction (not per-request). Check if there's a shared client or if one is created per call. In core/archipelago/src/network/dwn_sync.rs line 76, add .timeout(Duration::from_secs(5)) to the DWN health check request. Run cargo test.
R30-R33 — Resolve all TODO comments: (1) api/rpc/handshake.rs:77 — "TODO: track last-seen timestamp": Either implement it (add timestamp field to peer struct) or remove the comment. (2) api/rpc/marketplace.rs:183 — "TODO: Add lnd.lookupinvoice": Either implement or remove dead code path. (3) container/health_monitor.rs:140 — "TODO: Trigger auto-restart or alert": Either implement or remove. (4) security/container_policies.rs:68 — "TODO: Configure Podman to use the profile": Either implement or remove. Per project rules: no TODO in committed code. Run cargo clippy.
Phase 7 verification gate: Run cargo clippy --all-targets --all-features — zero warnings. Run cargo test --all-features — all pass. Run grep -rn 'unwrap\|expect' core/ --include='*.rs' | grep -v test | grep -v target | grep -v 'unwrap_or\|unwrap_err' — review remaining instances. Run grep -rn 'TODO\|FIXME\|HACK' core/ --include='*.rs' | grep -v target — zero results. Run grep -rn '127.0.0.1:8332\|127.0.0.1:3100' core/archipelago/src/ --include='*.rs' | grep -v constants.rs | grep -v target — zero results (all using constants).

Phase 8: P2 Frontend — Resilience and Quality

F8 — Fix WebSocket reconnection race: In neode-ui/src/api/websocket.ts lines 212-238, add a private isReconnecting = false flag. In doReconnect(), check if (this.isReconnecting) return; at the start, set this.isReconnecting = true, and in the .then()/.catch() of this.connect(), set it back to false. This prevents two onclose events from triggering parallel reconnections. Run npm run type-check.
F9 — Handle WebSocket parse errors: In neode-ui/src/api/websocket.ts lines 164-172, the catch block silently swallows JSON parse errors. Add a counter: private parseErrorCount = 0. In the success path, reset to 0. In the catch, increment. If parseErrorCount > 3, call this.ws?.close() to trigger reconnection (which will get fresh state per F4 fix). Run npm run type-check.
F11 — Reduce RPC client timeout and improve backoff: In neode-ui/src/api/rpc-client.ts, find the timeout value (likely 30000ms) and reduce to 15000ms. Find the retry backoff delay (likely 600 * (attempt + 1)) and add jitter: Math.floor(600 * (attempt + 1) * (0.5 + Math.random() * 0.5)). This prevents thundering herd on server recovery and reduces max wait from 40s to ~20s. Run npm run type-check.
F12 — Add code splitting via lazy routes: In neode-ui/src/router/index.ts, find all route component imports like import Web5 from '@/views/Web5.vue' and change to const Web5 = () => import('@/views/Web5.vue'). Do this for ALL view imports (Web5, Mesh, Dashboard, Settings, Marketplace, Server, Home, AppDetails, Login, Onboarding*, etc.). Keep only the root App.vue as a static import. Then in neode-ui/vite.config.ts, add under build:: rollupOptions: { output: { manualChunks: { vendor: ['vue', 'vue-router', 'pinia'], api: ['./src/api/rpc-client.ts', './src/api/websocket.ts'] } } }. Run npm run build and check that output has multiple chunk files, not one monolithic bundle.
F13 — Add DOMPurify to QR code v-html: In neode-ui/src/views/Settings.vue around line 441, find the v-html usage for QR codes. Install DOMPurify if not already: npm install dompurify @types/dompurify. Import it: import DOMPurify from 'dompurify'. Before assigning to the ref: sanitizedQrSvg.value = DOMPurify.sanitize(qrCodeSvg, { USE_PROFILES: { svg: true } }). Verify the package exists first with npm view dompurify version. Run npm run type-check.
F14+F15 — Goals performance + localStorage safety: In neode-ui/src/stores/goals.ts, replace the O(n) matchesAppId array lookup with a Map<string, Set<string>> for instant lookups. For localStorage saves (lines 34-36 and other stores), wrap all localStorage.setItem() calls in try/catch: try { localStorage.setItem(...) } catch (e) { console.warn('localStorage full:', e) }.
Phase 8 verification gate: Run cd neode-ui && npm run type-check (zero errors). Run cd neode-ui && npm test (all pass). Run cd neode-ui && npm run build and verify multiple chunks in output (ls -la ../web/dist/neode-ui/assets/*.js | wc -l should be > 3). Deploy to .198 and navigate all views.

Phase 9: Script Quality + Remaining P2

S10 — Replace silent error masking in deploy script: In scripts/deploy-to-target.sh, find the most critical instances of 2>/dev/null || echo "" (health checks, service status). Replace with || { log_warn "Health check failed for $TARGET_HOST"; echo ""; }. Keep the || echo "" fallback but add logging before it. Focus on the health check functions first (around lines 234-248). Don't change every instance — just the ones that mask real failures (health, service restart, container status).
S11 — Add trap cleanup to major scripts: In scripts/deploy-to-target.sh, add near the top (after set -eo pipefail): TMPDIR="/tmp/archipelago-deploy-$$"; mkdir -p "$TMPDIR"; trap 'rm -rf "$TMPDIR"' EXIT. Use $TMPDIR for any temp files instead of hardcoded /tmp paths. Do the same for scripts/deploy-tailscale.sh and image-recipe/build-auto-installer-iso.sh.
S12 — Quote unquoted variables: Run shellcheck scripts/deploy-to-target.sh scripts/first-boot-containers.sh scripts/deploy-tailscale.sh 2>/dev/null | grep 'SC2086' | head -20 to find the most critical unquoted variables. Fix at least the top 20 instances. Double-quote all $VARIABLE references in command arguments where word splitting could cause issues.
S13 — Extract hardcoded IPs to config: Create scripts/deploy-config-defaults.sh (not gitignored) with: DEFAULT_PRIMARY="192.168.1.228", DEFAULT_SECONDARY="192.168.1.198", TAILSCALE_ARCH1="100.82.97.63", TAILSCALE_ARCH2="100.122.84.60", TAILSCALE_ARCH3="100.124.105.113". Source this file from deploy-to-target.sh, deploy-tailscale.sh, and any script that hardcodes IPs. Use the variables instead of literal IPs.
S15 — Add memory limits to deploy UI containers: In scripts/deploy-to-target.sh, find where UI containers are created (lines ~842-880: lnd-ui, electrs-ui, bitcoin-ui). Add --memory=256m to each $DOCKER run command. These are lightweight nginx containers serving static files — 256MB is generous.
F16+F17+F18+F19 — Minor frontend fixes: (1) filebrowser-client.ts: Remove in-memory token, use cookie-only auth. (2) rpc-client.ts: Add header fallback for CSRF token — if cookie not found, check meta[name="csrf-token"]. (3) aiPermissions.ts: Add runtime validation when loading from localStorage — validate each item is a valid category string. (4) AppSession.vue:507: Track the setTimeout in a let variable and clear it in onBeforeUnmount. Run npm run type-check after all.
Phase 9 verification gate: Run grep -c 'trap.*EXIT' scripts/deploy-to-target.sh scripts/deploy-tailscale.sh — both should return 1. Deploy to .198 and verify all UI containers have memory limits: ssh -i ~/.ssh/archipelago-deploy archipelago@192.168.1.198 "podman inspect --format '{{.HostConfig.Memory}}' archy-lnd-ui archy-electrs-ui archy-bitcoin-ui 2>/dev/null".

Phase 10: Backend Architecture — Split God Files (Part 1)

R35 — Split package.rs into submodules (Part 1: Extract config.rs): Create core/archipelago/src/api/rpc/package/ directory. Create mod.rs that re-exports everything from the original. Move ALL get_app_config(), get_app_capabilities(), needs_archy_net(), and related constant/lookup functions into config.rs. The original package.rs imports from the new module. Run cargo test — all must pass. Run cargo clippy — zero warnings.
R35 — Split package.rs (Part 2: Extract validation.rs): Move input validation functions (app ID validation, dependency checking, image name validation) from package.rs into package/validation.rs. Update imports. Run cargo test.
R35 — Split package.rs (Part 3: Extract lifecycle.rs): Move install, start, stop, restart, uninstall operations into package/lifecycle.rs. Move progress streaming into package/progress.rs. The remaining package.rs (or package/mod.rs) should be a thin dispatcher under 200 lines that delegates to the sub-modules. Run cargo test — all existing RPC calls must return identical responses.
R36 — Split mesh/listener.rs into submodules: Create core/archipelago/src/mesh/listener/ directory. Extract: (1) session.rs — run_mesh_session() loop. (2) frames.rs — handle_frame() dispatcher. (3) identity.rs — handle_identity_received(), handle_typed_message(). (4) sync.rs — sync_queued_messages(), store_typed_message(). (5) bitcoin.rs — Bitcoin relay RPC operations. Keep mod.rs as the entry point with spawn_mesh_listener(). No file should exceed 500 lines. Run cargo test.
R37 — Split rpc/mod.rs into submodules: Extract: (1) dispatcher.rs — method name → handler routing match statement. (2) middleware.rs — CSRF validation, session checking, rate limiting logic. (3) response.rs — response building, error formatting. Keep mod.rs as the thin entry point that wires everything together. No file > 500 lines. Run cargo test.
R38 — Split lnd.rs into submodules: Create api/rpc/lnd/ directory. Extract: (1) wallet.rs — balance, send, receive, invoices. (2) channels.rs — open, close, list channels. (3) info.rs — node info, network info, connection strings. (4) payments.rs — payment history, routing. No file > 500 lines. Run cargo test.
Phase 10 verification gate: Run cargo clippy --all-targets --all-features — zero warnings. Run cargo test --all-features — all pass. Check file sizes: find core/archipelago/src/ -name '*.rs' -exec wc -l {} + | sort -rn | head -20 — no file should exceed 600 lines (allowing some margin). Deploy to .198 and verify all RPC endpoints work.

Phase 11: Frontend Architecture — Split God Components (Part 1)

F25 — Split Web5.vue (Part 1: Router shell + Identity): In neode-ui/src/router/index.ts, add nested routes under the Web5 route: { path: 'identity', component: () => import('@/views/web5/Web5Identity.vue') }, etc. Create neode-ui/src/views/web5/Web5.vue as a layout shell (~150 lines) with <router-view /> for sub-views. Extract the DID management section into Web5Identity.vue. Ensure the route transition works smoothly. Run npm run type-check.
F25 — Split Web5.vue (Part 2: Extract remaining sections): Create Web5Wallet.vue (wallet operations), Web5Nostr.vue (Nostr relays/profiles), Web5Credentials.vue (Verifiable Credentials), Web5Peers.vue (P2P federation), Web5Storage.vue (DWN storage/explorer), Web5Goals.vue (goals/voting), Web5Marketplace.vue (decentralized marketplace). Each should be under 500 lines. Move shared state to composables if needed (e.g., useWeb5Identity()). Run npm run type-check and npm test.
F26 — Split Mesh.vue into submodules: Create views/mesh/Mesh.vue as layout with tabs. Extract: MeshRadio.vue (radio status, device connection), MeshChat.vue (chat interface, messages), MeshNetwork.vue (topology, peers), MeshFederation.vue (federation sync). Add nested routes. No component > 500 lines. Run npm run type-check.
F27 — Split Dashboard.vue into submodules: Create views/dashboard/Dashboard.vue as sidebar + router-view shell. Extract: DashboardHome.vue (overview cards), DashboardApps.vue (running apps, quick actions), DashboardSystem.vue (CPU/RAM/disk stats). Run npm run type-check.
F28 — Split Settings.vue into submodules: Create views/settings/Settings.vue as tab navigation shell. Extract: SettingsAccount.vue (password, 2FA, sessions), SettingsSystem.vue (server name, reboot, updates), SettingsNetwork.vue (Tor, Tailscale), SettingsAppearance.vue (theme, screensaver). Run npm run type-check.
Phase 11 verification gate: Run cd neode-ui && npm run type-check — zero errors. Run npm test — all pass. Check component sizes: find neode-ui/src/views -name '*.vue' -exec wc -l {} + | sort -rn | head -20 — no component should exceed 600 lines. Deploy to .198 and navigate every section of Web5, Mesh, Dashboard, Settings.

Phase 12: Frontend Architecture — Split God Components (Part 2)

F29+F30+F31+F32 — Split remaining large views: Split Marketplace.vue (1,293 lines) into marketplace/MarketplaceGrid.vue, MarketplaceFilters.vue, MarketplaceInstall.vue. Split Server.vue (1,132 lines) into server/ServerOverview.vue, ServerContainers.vue, ServerLogs.vue. Split Home.vue (1,059 lines) into home/HomeOverview.vue, HomeApps.vue, HomeStatus.vue. Split AppDetails.vue (1,036 lines) into app/AppOverview.vue, AppLogs.vue, AppConfig.vue. Run npm run type-check after each split.
F33 — Decompose useAppStore into focused stores: Create: stores/auth.ts (login, logout, session, password, TOTP — ~100 lines), stores/server.ts (server info, stats, reboot/shutdown — ~80 lines), stores/realtime.ts (WebSocket connection, subscriptions, heartbeat — ~80 lines), stores/packages.ts (package install/uninstall, marketplace — ~80 lines). Keep stores/app.ts as a thin re-export: export { useAuthStore } from './auth'; export { useServerStore } from './server'; ... plus a useAppStore() function that returns a composed object for backward compatibility. Run npm run type-check and npm test.
F20+F21+F22+F23+F24 — Remaining P3 frontend fixes: (1) Dashboard.vue: add aria-current="page" to active RouterLink. (2) Apps.vue: debounce search input (150ms) and memoize lowercase strings. (3) style.css: add @media (max-width: 768px) { .glass-card, .glass-button { backdrop-filter: blur(8px); } } to reduce mobile GPU load. (4) types/api.ts: replace Record<string, unknown> for DID operations with branded types. (5) websocket.ts: track checkInterval and clear in all paths. Run npm run type-check.
Phase 12 verification gate: Run cd neode-ui && npm run type-check — zero errors. Run npm test — all pass. find neode-ui/src/views -name '*.vue' -exec wc -l {} + | sort -rn | head -10 — no component > 600 lines. wc -l neode-ui/src/stores/app.ts — should be under 100 lines (thin re-export). Deploy to .198 and navigate all views.

Phase 13: Script Architecture — Shared Library + Splits

S21 — Create shared script library: Create scripts/lib/common.sh with functions extracted from duplicated patterns: log_info(), log_warn(), log_error() (colored logging), ssh_cmd() (SSH wrapper with key), wait_for_health() (health poll loop), check_disk_space(), mem_limit() (memory limit calculator). Source it from deploy-to-target.sh, first-boot-containers.sh, deploy-tailscale.sh. Run each script with --dry-run or --help to verify sourcing works.
S18 — Split deploy-to-target.sh (Part 1): Create scripts/deploy/frontend.sh — extract frontend build + sync logic. Create scripts/deploy/backend.sh — extract backend build + sync logic. Keep deploy-to-target.sh as orchestrator that sources lib/common.sh and calls the sub-scripts. Target: orchestrator < 400 lines, each sub-script < 300 lines. Test with ./scripts/deploy-to-target.sh --dry-run --target 192.168.1.198.
S18 — Split deploy-to-target.sh (Part 2): Extract scripts/deploy/configs.sh (nginx, systemd, script sync), scripts/deploy/containers.sh (container creation/update), scripts/deploy/verify.sh (post-deploy health checks), scripts/deploy/rollback.sh (rollback on failure). No file > 400 lines.
S19 — Split build-auto-installer-iso.sh: Create image-recipe/build/capture-images.sh, build/create-rootfs.sh, build/install-packages.sh, build/bundle-configs.sh, build/package-iso.sh. Keep orchestrator under 300 lines.
S20 — Split first-boot-containers.sh: Create scripts/first-boot/databases.sh (MariaDB, PostgreSQL, Redis), first-boot/bitcoin.sh (Bitcoin Knots, ElectrumX), first-boot/lightning.sh (LND, BTCPay), first-boot/apps.sh (Nextcloud, Jellyfin, etc.), first-boot/networking.sh (Tor, Tailscale). Each sources lib/common.sh. No file > 300 lines.
S16 — Make ISO builds reproducible: Create scripts/image-versions.env with pinned digests: BITCOIN_IMAGE="docker.io/bitcoinknots/bitcoin:v28.1@sha256:...". Source this in build-auto-installer-iso.sh. Never fall back to :latest. Add a manifest file to ISO output recording exact image digests shipped.
Phase 13 verification gate: wc -l scripts/deploy-to-target.sh < 400. wc -l scripts/first-boot-containers.sh < 300. wc -l image-recipe/build-auto-installer-iso.sh < 300. grep -rn ':latest' scripts/ image-recipe/ | grep -v node_modules | grep -v '#' | grep -v '.md' — zero results. Test deploy: ./scripts/deploy-to-target.sh --dry-run --target 192.168.1.198 — succeeds.

Phase 14: Integration Tests

Backend integration tests (Part 1): Create core/archipelago/tests/test_auth_flow.rs — test login → session → CSRF → authenticated request → logout. Create test_rpc_validation.rs — test every public endpoint with invalid input → proper error code. Create test_session_persist.rs — create session → simulate restart → session survives. Create test_rate_limiting.rs — flood endpoint → 429 → wait → allowed. Run cargo test --all-features.
Backend integration tests (Part 2): Create test_container_lifecycle.rs — install → start → health → stop → uninstall (mock Podman). Create test_backup_restore.rs — create backup → verify integrity → restore to staging → validate. Create test_health_endpoint.rs — healthy → degraded → recovery transitions. Target: 25+ tests passing.
Frontend integration tests: Create neode-ui/src/__tests__/integration/auth-flow.spec.ts — login → dashboard → timeout → redirect. Create app-lifecycle.spec.ts — marketplace → install → progress → launch → uninstall. Create websocket.spec.ts — connect → update → disconnect → reconnect → state consistent. Create error-handling.spec.ts — network error → toast → retry → success. Create settings-flow.spec.ts — password change → re-login → 2FA setup. Target: 20+ tests passing. Run npm test.
E2E smoke test script: Create scripts/smoke-test.sh that runs against .198. Tests: (1) curl /health → OK. (2) Login via RPC → get session. (3) server.get-info → valid JSON. (4) container.list → success. (5) Check every /app/* proxy responds. (6) Check WebSocket upgrade (101). (7) Check Tor hidden service if available. Exit 0 only if all pass. Make executable. Run against .198.
Phase 14 verification gate: cargo test --all-features — 25+ tests pass (count with cargo test --all-features 2>&1 | grep 'test result'). cd neode-ui && npm test -- --reporter=verbose 2>&1 | grep -c 'PASS\|✓' — 20+ tests. ./scripts/smoke-test.sh 192.168.1.198 — exits 0.

Phase 15: Type Sync + CI/CD Documentation

Rust→TypeScript type generation: Add ts-rs = "10" to core/models/Cargo.toml (verify it exists first with cargo search ts-rs). Add #[derive(TS)] and #[ts(export)] to all API request/response structs in core/models/src/. Create a build script or test that generates TypeScript to neode-ui/src/types/generated.ts. Replace manual types in neode-ui/src/types/api.ts with imports from generated.ts where applicable. Run both cargo test and npm run type-check to verify.
Document CI/CD pipeline plan: Create docs/ci-cd-plan.md documenting the planned GitHub Actions CI/CD setup. Include: (1) CI workflow (triggers: push to main + PRs; jobs: cargo clippy, cargo fmt --check, cargo test, npm type-check, npm lint, npm test; merge policy: all checks must pass). (2) Release workflow (triggers: tag push v*; jobs: Linux binary cross-compile, frontend build, ISO build via SSH, QEMU verification). (3) Pre-requisites list (GitHub Actions runners, Rust toolchain, SSH key for build server, branch protection rules, image digest manifest). (4) Estimated implementation time: 2 weeks. This is documentation only — do not implement CI/CD yet.
Final verification sweep: Run ALL verification gates from every phase. Deploy to .198. Run smoke test. Verify: no Rust file > 500 lines, no Vue component > 500 lines, no script > 400 lines, no store > 1 responsibility, zero unwraps in production, zero :latest tags, zero sudo podman, zero blocking I/O in async, zero TODO comments. Document results.
Update architecture docs: Update docs/architecture-review.html tech debt map and quality scores to reflect all completed work. Update docs/architecture.md codebase stats. Update docs/BETA-PROGRESS.md with completion status. Commit with docs: update architecture review with completed refactoring.

44 KiB Raw Permalink Blame History

Overnight Plan — Production Excellence

Phase 1: P0 Backend — Hangs, Data Loss, Missing Handlers

Phase 2: P0 Frontend — Race Conditions and Silent Failures

Phase 3: P1 Backend — Blocking I/O in Async Context

Phase 4: P1 Frontend — Memory Leaks and Stale State

Phase 5: P1 Scripts — Deploy Safety and Error Handling

Phase 6: P1 Infrastructure + Remaining P1 Backend

Phase 7: P2 Backend — Unwraps, Dead Code, Hardcoded Values

Phase 8: P2 Frontend — Resilience and Quality

Phase 9: Script Quality + Remaining P2

Phase 10: Backend Architecture — Split God Files (Part 1)

Phase 11: Frontend Architecture — Split God Components (Part 1)

Phase 12: Frontend Architecture — Split God Components (Part 2)

Phase 13: Script Architecture — Shared Library + Splits

Phase 14: Integration Tests

Phase 15: Type Sync + CI/CD Documentation

44 KiB

Raw Permalink Blame History