feat(reconcile): add --create-missing flag for recovering from failed-update rollbacks
Context: when package update fails after remove-old-container but before reconcile-recreate, the rollback path in update.rs tries to restart the old container by name. If the container is already gone (removed in step 3 of the update), rollback fails silently and the node is left with no live container for that app but on-disk data still intact. This is exactly the state .228 ended up in after the reconcile-script-missing bug killed bitcoin-knots and lnd. Reconcile was designed to only repair existing containers for optional apps (SPEC_OPTIONAL=true): it skips "not installed" entries on the assumption that the install RPC creates them. That safety check is correct for normal operation but blocks recovery when an optional-marked container has been destroyed by a failed update. Fix: add --create-missing flag that overrides the SPEC_OPTIONAL skip. When set, reconcile treats absent containers exactly the same as broken containers — it creates them from the canonical spec using the existing on-disk data directory. Narrow-scope override; the default behaviour is unchanged. Updated --help to document all four flags. Verified on .228: after the failed bitcoin-core update took out both bitcoin-knots and lnd, running reconcile --container=bitcoin-knots --create-missing --force (as the archipelago user, not root — podman is rootless) brought bitcoin-knots back using the pruned chainstate at /var/lib/archipelago/bitcoin. Repeated for lnd. All containers now running; electrumx reconnecting; UIs recovering. Does NOT fix the underlying update-flow rollback hole (rollback should be able to re-create a container from spec, not just restart by name). That is a separate commit — this flag is the manual recovery tool plus the primitive the improved rollback will call.
This commit is contained in:
@@ -18,16 +18,25 @@ SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
|
||||
# ── Parse arguments ──────────────────────────────────────────────────
|
||||
CHECK_ONLY=false
|
||||
FORCE=false
|
||||
CREATE_MISSING=false
|
||||
FILTER_TIER=""
|
||||
FILTER_CONTAINER=""
|
||||
for arg in "$@"; do
|
||||
case "$arg" in
|
||||
--check-only) CHECK_ONLY=true ;;
|
||||
--force) FORCE=true ;;
|
||||
--create-missing) CREATE_MISSING=true ;;
|
||||
--tier=*) FILTER_TIER="${arg#*=}" ;;
|
||||
--container=*) FILTER_CONTAINER="${arg#*=}" ;;
|
||||
-h|--help)
|
||||
echo "Usage: $0 [--check-only] [--force] [--tier=N] [--container=NAME]"
|
||||
echo "Usage: $0 [--check-only] [--force] [--create-missing] [--tier=N] [--container=NAME]"
|
||||
echo ""
|
||||
echo " --check-only Audit only, no changes."
|
||||
echo " --force Override user-stopped state."
|
||||
echo " --create-missing Override SPEC_OPTIONAL for containers that have on-disk"
|
||||
echo " data but no live container (recovery from failed updates)."
|
||||
echo " --tier=N Only reconcile containers in tier N."
|
||||
echo " --container=NAME Only reconcile the named container (spec key)."
|
||||
exit 0 ;;
|
||||
esac
|
||||
done
|
||||
@@ -213,7 +222,9 @@ reconcile() {
|
||||
|
||||
# Optional apps: only reconcile if already installed (container exists).
|
||||
# The install RPC creates the container; the reconciler just keeps it running.
|
||||
if [ "$SPEC_OPTIONAL" = "true" ] && ! container_exists "$name"; then
|
||||
# --create-missing overrides this so we can recover from failed-update rollbacks
|
||||
# that deleted a container without restoring it (on-disk data still present).
|
||||
if [ "$SPEC_OPTIONAL" = "true" ] && ! container_exists "$name" && ! $CREATE_MISSING; then
|
||||
skip "$name — not installed"
|
||||
COUNT_SKIPPED=$((COUNT_SKIPPED + 1))
|
||||
return
|
||||
|
||||
Reference in New Issue
Block a user