Monitor-E¶
Phase 3
Monitor-E is planned for Phase 3 of the Rig. This document describes the target design.
Monitor-E watches production 24/7. When something breaks, it detects it, creates a GitHub issue, and Conductor-E assigns it to Dev-E for a fix-forward response.
Responsibilities¶
- Watch production health endpoints
- Monitor error rates, latency, and anomalies
- Detect incidents and classify severity
- Create GitHub issues with diagnostic context
- Emit events to the Event Store
Events Emitted¶
| Event | When | Data |
|---|---|---|
INCIDENT_DETECTED |
Health check fails or anomaly detected | service, severity, details |
INCIDENT_RESOLVED |
Health recovered | service, duration |
SMOKE_PASSED |
Post-deploy smoke test passes | environment, checks |
SMOKE_FAILED |
Post-deploy smoke test fails | environment, reason, details |
Smoke Test Classification¶
When a smoke test fails after deployment:
| Reason | Action |
|---|---|
code |
Agent can fix — retry up to 3 times |
external_dependency |
Not our fault (Stripe down, Claude API outage) — escalate to CTO immediately, no retry |
What It Monitors¶
| Check | Frequency | Threshold |
|---|---|---|
Health endpoints (/health) |
Every 60s | 3 consecutive failures |
| Error rate (5xx) | Every 5 min | > 1% of requests |
| Response latency (p95) | Every 5 min | > 2x baseline |
| Certificate expiry | Daily | < 30 days |
| Disk usage | Every 15 min | > 85% |
Integration¶
- Reads from: Grafana/Prometheus (if available), direct health checks
- Writes to: Event Store, GitHub (create issues), Discord (#admin for critical)
- Conductor-E reads
INCIDENT_DETECTEDevents and auto-assigns to Dev-E withcriticalpriority
Fix-Forward Philosophy¶
"Fix forward, not rollback." When production breaks:
- Monitor-E detects → creates issue with
criticallabel - Conductor-E assigns to Dev-E immediately (critical = top priority)
- Dev-E attempts fix-forward
- If 2 attempts fail → escalate to CTO
- Rollback is never automatic — always a CTO decision