fix: overhaul container lifecycle — recovery, health, uninstall, UI state

Container recovery: - Health monitor: MAX_RESTART_ATTEMPTS 3→10, interval 60s→120s - Dependency-aware restarts: won't restart services before their deps - Reset dependent counters when a dependency recovers - Handle "created" state containers (were invisible to health monitor) - Added IndeedHub, mempool-api, mysql to tier system - Crash recovery: podman start timeout 30s→120s with retry - Podman client: socket timeout 5s→30s, added restart policy UI state representation: - Exit code 0 shows "stopped" (gray), not "crashed" (red) - Exit code 137 shows "killed (OOM)" - Non-zero exit shows "crashed" (red) - Added exit_code field to PackageDataEntry Install/uninstall fixes: - Install returns error when container doesn't start (was silent success) - Post-install hooks awaited instead of fire-and-forget tokio::spawn - Uninstall: graceful rm before force, volume prune, network cleanup - Uninstall returns error on partial failure (was 200 OK) Config consistency: - DB passwords read from /var/lib/archipelago/secrets/ (was hardcoded) - Bitcoin: added ZMQ ports 28332/28333 for LND block notifications - IndeedHub port 7777→8190 (was conflicting with strfry) - Marketplace versions: LND 0.17.4→0.18.4, Mempool 2.5.0→3.0.0 Performance: - Metrics collector interval 60s→300s (was duplicating health monitor) - Podman client: proper error propagation instead of unwrap_or_default Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 07:03:57 +01:00
parent cdff10a8bc
commit 64b57dca7d
65 changed files with 3950 additions and 298 deletions
--- a/core/archipelago/src/monitoring/mod.rs
+++ b/core/archipelago/src/monitoring/mod.rs
@@ -14,18 +14,21 @@ use std::path::PathBuf;
 use std::sync::Arc;
 use tracing::{debug, warn};

-/// Spawn the background metrics collector (runs every 60 seconds).
+/// Spawn the background metrics collector (runs every 300 seconds / 5 minutes).
 /// Evaluates alert rules on each snapshot and dispatches notifications.
+/// Note: health_monitor.rs handles container state polling at 120s intervals.
+/// This collector handles system-level metrics (CPU, disk, network) and only
+/// calls podman stats every 5 minutes to avoid duplicate subprocess overhead.
 pub fn spawn_metrics_collector(
    store: Arc<MetricsStore>,
    state: Option<Arc<crate::state::StateManager>>,
    data_dir: Option<PathBuf>,
 ) {
    tokio::spawn(async move {
-        // Wait 30s for system to stabilize after boot
-        tokio::time::sleep(std::time::Duration::from_secs(30)).await;
+        // Wait 60s for system to stabilize after boot
+        tokio::time::sleep(std::time::Duration::from_secs(60)).await;

-        let mut interval = tokio::time::interval(std::time::Duration::from_secs(60));
+        let mut interval = tokio::time::interval(std::time::Duration::from_secs(300));

        loop {
            interval.tick().await;