Skip to content

Operator Health: Preset-Aware Supervision Policy

The health system classifies running loops into buckets using preset-specific thresholds rather than a single global timeout. This allows long-running presets (like autospec) to remain healthy while short-running presets (like autosimplify) escalate quickly.

Health Buckets

BucketMeaning
ActiveRunning and recently updated — no concern.
WatchingQuiet longer than the preset's warning threshold but not yet stuck. Investigate soon.
StuckQuiet longer than the preset's stuck threshold. Likely needs intervention.
FailedFailed or timed out within the last 24 hours.
CompletedCompleted within the last 24 hours (shown with --verbose).

Policy Table

PresetWarning AfterStuck After
autospec10 min20 min
autocode5 min12 min
autosimplify2 min6 min
autoqa6 min15 min
autofix4 min10 min
(default)5 min10 min

Unknown presets fall back to the default policy.

Surfaces

All operator surfaces share the same classification logic from src/loops/health.ts:

  • autoloop loops health — prints a summary with stuck, watching, failed, and active sections.
  • autoloop loops watch <run-id> — prints a one-line advisory when a run transitions into the watching or stuck band.
  • Dashboard /api/runs — returns JSON with active, watching, stuck, recentFailed, and recentCompleted arrays.

Design Notes

  • Thresholds are intentionally heuristic. They reflect typical iteration cadence per preset and may evolve as usage patterns become clearer.
  • Classification is computed from the updated_at field in the run registry. No additional metadata is required.
  • The policy module (src/loops/policy.ts) is the single source of truth for all thresholds.