Validator Load Test Report (2026-02-07): Overnight Finalization Gap

Context

This report documents the overnight behavior after an ultra load test run was started on 2026-02-06 and left running while monitoring continued.

Primary concern:

very high verified volume
much lower finalized volume
persistent backlog that did not clear overnight

Observed live state (captured around 2026-02-07T03:46Z):

totalVerifiedCount=81166
totalFinalizedCount=10592
gap: 70574
batchQueueSize=25385
backpressurePendingCount=14482

Test Artifacts

Main run logs:
- /Users/mhess/oak-chain/logs/ultra-30min-20260206-220352.log
- /Users/mhess/oak-chain/logs/ultra-30min-20260206-221234.log
Client logs:
- /Users/mhess/oak-chain/logs/ultra-30min-client-*-20260206-220352.log
- /Users/mhess/oak-chain/logs/ultra-30min-client-*-20260206-221234.log
Signal capture:
- /Users/mhess/oak-chain/logs/limit-signature-20260206-220105.csv
Validator logs:
- /Users/mhess/oak-chain/logs/validator-0.log
Observer captures:
- /Users/mhess/oak-chain/postmortem/load-bottleneck-*

High-Level Outcome

The cluster entered and remained in an over-limit debt state:

ingest and verification progressed
finalization lagged far behind
sender/backpressure path repeatedly timed out and re-queued batches
backlog plateaued instead of draining

This is not a verifier rejection event and not a replication lag event.

Key Evidence

1. Runs were interrupted before full 30 minutes

From main logs:

run 220352 cleaned up at 22:09:38
run 221234 cleaned up at 22:26:57

Implication:

not a completed benchmark run
observed debt reflects interrupted overload sessions

2. Rate limiting was active during ingest

validator-0.log contains large volumes of:

Rate limit exceeded for client: 127.0.0.1

Count in validator-0.log:

442840 rate-limit log entries

Implication:

local multi-client load collapsed to same client identity (127.0.0.1) from limiter perspective
429 pressure was present while backlog formed

3. No file descriptor crash in this overnight run

validator-0.log:

Too many open files: 0 matches

Implication:

this event is a throughput/finalization bottleneck, not an FD-exhaustion recurrence

4. Signal capture confirms widening debt during active window

From limit-signature-20260206-220105.csv:

active-window verify rate: ~30.57/s
active-window finalize rate: ~5.65/s
max queue_pending: 27074
max backpressure_pending: 10550
max gap (verified_minus_finalized): 30175
tail (last ~58s in capture): queue slightly down but backpressure and gap still rising

Implication:

system was still accumulating finalize debt near end of capture window

5. Backpressure path repeatedly timed out for hours

Repeated entries in validator-0.log:

Backpressure timeout: pending=14482, max=10000, waited=30000ms
Backpressure timeout - re-queuing batch (...)

Live queue stats remained pinned:

BackpressureManager[sent=14483, acked=1, pending=14482, max=10000, active=true]

Implication:

downstream sender/ack/finalizer path is the hard bottleneck
queue debt can become quasi-static without further ingest

Root Cause Assessment

Primary bottleneck:

downstream batch send + ack/finalization throughput

Not primary in this event:

verifier input queue (mempool was clear)
replication lag (maxLagMs=0)
durability timeout (pendingAcks=0 in OPS overview)

Operational nuance:

queue/finalization counters are leader-centric
OPS overview vs signals showed inconsistent verifier wait values in some windows, reducing trust in real-time severity display

Why Verified Can Be High While Finalized Stays Low

Pipeline behavior under overload:

writes are accepted and verified quickly enough
verified proposals enter batch/sender path
sender hits backpressure limit and timeout loop
batches are re-queued repeatedly
finalization progresses very slowly, so gap widens and persists

Remediation Plan (Operations + Tuning)

Phase 1: Stop debt plateauing

Tune these together (single-variable changes are misleading):

oak.proposal.batch.max
oak.proposal.finalization.chunk.size
oak.consensus.max.pending.messages
oak.consensus.backpressure.timeout.ms

Goal:

eliminate repeated pending=14482 timeout loop
achieve sustained negative slope on:
- batchQueueSize
- backpressurePendingCount
- verified-finalized gap

Phase 2: Benchmark with debt-aware pass/fail

A run is pass only if:

target throughput achieved
gap bounded during steady state
backlog drains after load stop within recovery window

Do not grade success by ingest/verified counts alone.

Phase 3: Improve observability truth

enforce parity between OPS overview and OPS signals for verifier metrics
keep leader attribution explicit in run summaries

Immediate Operator Actions

Before next ultra run:

confirm clean baseline (batchQueueSize=0, backpressurePendingCount=0, low gap)

During run:

capture OPS + leader queue stats every 5s
capture jstack series during first overload onset

After run:

require a recovery observation window (no load) and measure drain rate
fail run if backlog plateaus

Success Criteria

“Everything finalizes” means:

no long-lived backlog plateau
no repeated backpressure timeout loop at a fixed pending count
verified/finalized gap returns near baseline after load removal
this behavior is repeatable across multiple runs

Bottom Line

The overnight event confirms a real downstream finalization bottleneck. The system can accept and verify large volume, but finalization is constrained by sender/backpressure behavior and can stall into persistent debt. Any readiness claim must require debt drain, not just ingest.

Validator Load Test Report (2026-02-07): Overnight Finalization Gap ​

Context ​

Test Artifacts ​

High-Level Outcome ​

Key Evidence ​

1. Runs were interrupted before full 30 minutes ​

2. Rate limiting was active during ingest ​

3. No file descriptor crash in this overnight run ​

4. Signal capture confirms widening debt during active window ​

5. Backpressure path repeatedly timed out for hours ​

Root Cause Assessment ​

Why Verified Can Be High While Finalized Stays Low ​

Remediation Plan (Operations + Tuning) ​

Phase 1: Stop debt plateauing ​

Phase 2: Benchmark with debt-aware pass/fail ​

Phase 3: Improve observability truth ​

Immediate Operator Actions ​

Success Criteria ​

Bottom Line ​

Validator Load Test Report (2026-02-07): Overnight Finalization Gap

Context

Test Artifacts

High-Level Outcome

Key Evidence

1. Runs were interrupted before full 30 minutes

2. Rate limiting was active during ingest

3. No file descriptor crash in this overnight run

4. Signal capture confirms widening debt during active window

5. Backpressure path repeatedly timed out for hours

Root Cause Assessment

Why Verified Can Be High While Finalized Stays Low

Remediation Plan (Operations + Tuning)

Phase 1: Stop debt plateauing

Phase 2: Benchmark with debt-aware pass/fail

Phase 3: Improve observability truth

Immediate Operator Actions

Success Criteria

Bottom Line