release(v1.7.42-alpha): bitcoin RPC retry wrapper so syncing nodes stop flashing red

Closes failure mode adjacent to FM3 (docs/bulletproof-containers.md): on
a syncing pruned node, bitcoind's RPC thread blocks for 5-10s during block
validation. The old 10s client-side timeout was rejecting roughly 30% of
UI calls even though the node was perfectly healthy. 20x stress test on
the live .116 node (caught in IBD catch-up at block 797k) used to drop
10 of 20 calls; now drops 0 of 20.

What changed:
- core/archipelago/src/api/rpc/bitcoin.rs: bitcoin_rpc_call now retries up
  to 3 times with 500ms and 1500ms backoffs between attempts. Only
  transient transport errors (timeout, connect refused, send/recv IO)
  trigger retry. A well-formed bitcoind error response is surfaced
  immediately - real RPC bugs are never masked.
- Per-attempt hard deadline (tokio::time::timeout, 15s) layered on top
  of reqwest's own timeout, so DNS starvation or TLS wedging can't
  steal the entire retry budget.
- handle_bitcoin_getinfo client builder gained a 3s connect_timeout
  so a dead bitcoind is fast-failed inside the first attempt instead
  of eating the whole 15s.
- Retry policy extracted into a RetryConfig struct so tests can dial
  down timeouts to ~100ms per attempt. Production defaults live in
  RetryConfig::production().

Not changed (tracked as follow-up):
- mesh/mod.rs bitcoin_rpc_getblockcount and related helpers use the
  same 10s-timeout pattern. Not migrated to the new wrapper in this
  release; scheduled for v1.7.43 alongside the render_bitcoin_conf
  work.
- lnd/info.rs and electrs_status have similar 10s/15s timeouts but
  different failure profiles - audit first, migrate only the ones
  that actually exhibit the bug.

Tests: 6 new unit tests under api::rpc::bitcoin::tests, all passing.
Uses an in-process hyper server (already a transitive dep) to simulate
bitcoind responses; no new crates required.
  - happy_path_first_attempt: no retry when first attempt succeeds
  - retries_on_timeout_then_succeeds: first attempt times out, second
    succeeds, returns OK (uses a short-timeout RetryConfig so the test
    runs in <1s instead of 15s)
  - retries_exhausted_on_persistent_connect_refused: all attempts fail
    against a closed port, error bubbles up, elapsed time confirms
    backoffs actually ran
  - does_not_retry_on_rpc_level_error: bitcoind-returned error body is
    surfaced immediately, no retry
  - does_not_retry_parse_errors: non-JSON response (e.g. 503 with html
    body) is NOT retried - guards against the tempting "retry all
    non-2xx" mistake that would mask real bitcoind misconfig
  - retry_budget_invariants: asserts total wall-time ceiling stays
    under 60s so a bumped constant can't silently hang a UI call
    forever

Validated live on .116: 20/20 bitcoin.getinfo calls succeed during IBD
catch-up (chain at block 797419 -> 797464), vs ~40% baseline under the
old 10s timeout. Worst-case latency was 48.9s during peak validation;
happy-path latency (cached result) remains 28-77ms.
This commit is contained in:
archipelago
2026-04-22 16:46:28 -04:00
parent d1bcf271f9
commit 0ac673deb4
5 changed files with 459 additions and 31 deletions

View File

@@ -180,6 +180,16 @@ init()
</button>
</div>
<div class="overflow-y-auto flex-1 min-h-0 space-y-6 pr-1">
<!-- v1.7.42-alpha -->
<div>
<div class="flex items-center gap-2 mb-3">
<span class="text-xs font-mono px-2 py-0.5 rounded bg-orange-500/20 text-orange-300">v1.7.42-alpha</span>
<span class="text-xs text-white/40">Apr 22, 2026</span>
</div>
<div class="space-y-3 text-sm text-white/80 pl-3 border-l border-white/10">
<p>Bitcoin dashboards no longer flicker errors during initial chain sync. When bitcoind is busy validating a fresh block it can take up to 10 seconds to answer RPC the old code gave up after exactly 10 seconds, so any call that landed during that window surfaced as a failure even though the node was perfectly healthy. The RPC client now retries transient timeouts transparently (3 attempts, ~500ms + 1500ms backoff between them) and only surfaces errors that bitcoind itself reported. Connection refused is still fast-failed so genuinely-dead bitcoinds are reported in under a second.</p>
</div>
</div>
<!-- v1.7.41-alpha -->
<div>
<div class="flex items-center gap-2 mb-3">