Skip to content

Monitor-E

Phase 3

Monitor-E is planned for Phase 3 of the Rig. This document describes the target design.

Monitor-E watches production 24/7. When something breaks, it detects it, creates a GitHub issue, and Conductor-E assigns it to Dev-E for a fix-forward response.

Responsibilities

  • Watch production health endpoints
  • Monitor error rates, latency, and anomalies
  • Detect incidents and classify severity
  • Create GitHub issues with diagnostic context
  • Emit events to the Event Store

Events Emitted

Event When Data
INCIDENT_DETECTED Health check fails or anomaly detected service, severity, details
INCIDENT_RESOLVED Health recovered service, duration
SMOKE_PASSED Post-deploy smoke test passes environment, checks
SMOKE_FAILED Post-deploy smoke test fails environment, reason, details

Smoke Test Classification

When a smoke test fails after deployment:

Reason Action
code Agent can fix — retry up to 3 times
external_dependency Not our fault (Stripe down, Claude API outage) — escalate to CTO immediately, no retry

What It Monitors

Check Frequency Threshold
Health endpoints (/health) Every 60s 3 consecutive failures
Error rate (5xx) Every 5 min > 1% of requests
Response latency (p95) Every 5 min > 2x baseline
Certificate expiry Daily < 30 days
Disk usage Every 15 min > 85%

Integration

  • Reads from: Grafana/Prometheus (if available), direct health checks
  • Writes to: Event Store, GitHub (create issues), Discord (#admin for critical)
  • Conductor-E reads INCIDENT_DETECTED events and auto-assigns to Dev-E with critical priority

Fix-Forward Philosophy

"Fix forward, not rollback." When production breaks:

  1. Monitor-E detects → creates issue with critical label
  2. Conductor-E assigns to Dev-E immediately (critical = top priority)
  3. Dev-E attempts fix-forward
  4. If 2 attempts fail → escalate to CTO
  5. Rollback is never automatic — always a CTO decision