Mac Runner Fleet¶
Plan for treating iBuild-E (Mac Mini) and future Mac runners as a managed fleet, with the same lifecycle discipline as Dev-E rig pods.
Status: planning. Phase 1 is in flight via
dashecorp/infra#225(secrets pipeline). This doc captures the full lifecycle plan that infra#225 is the first phase of. Filed underStig-Johnny/claude-3#628.
Current state (1 Mac)¶
- 1 host: Mac Mini M4 (
100.92.170.124, userclaude) - Code: manual
git pullin~/repos/automate-e/ - Supervisor:
ai.invotek.ibuild-e.plist(launchd KeepAlive), edited by hand when env changes - Secrets: interactive
claude /loginwrites to macOS Keychain; expires unannounced - Config: no source of truth — drifts between operator SSH sessions
- Drift detection: none
- Observability: in-pod heartbeat → conductor, but host-level health invisible
Constraint¶
iOS builds require Xcode → macOS-only. The OS substrate cannot move to Linux containers. But everything above the OS can mirror the rig pod pattern.
Target architecture (parallel to Dev-E rig pods)¶
dashecorp/infra (OpenTofu)
├── secrets/mac-fleet.sops.yaml # Claude OAuth, GitHub App PEM, Apple ID
├── modules/mac-runner/
│ ├── launchd.plist.tpl # rendered per-host
│ ├── deploy-script.tpl # pull image tag, render plist, restart
│ └── outputs.tf # per-host config blobs
└── envs/mac-fleet/main.tf # one block per Mac (ibuild-e, ibuild-e-2, …)
Mac Mini (per host):
~/.rig/ # operator-untouched, OpenTofu-managed
├── current/ # symlink to versioned dir
│ ├── launchd.plist # rendered
│ ├── env # decrypted secrets, mode 0600
│ ├── repos/automate-e/ # git ref pinned to SHA
│ └── manifest.json # OpenTofu state echo: image_tag, secrets_rev, plist_hash
├── releases/<sha>/ # prior versions, kept for rollback
└── bin/
├── rig-deploy # pull-and-apply
└── rig-drift-check # compare /current vs OpenTofu state
Components¶
1. Secrets pipeline (infra#225 scope)¶
- SOPS file in
dashecorp/infraencrypts: Claude Code OAuth credentials JSON,ibuild-e-botGitHub App PEM, future per-pod tokens - age recipient = a per-Mac age public key (one per host so cross-host blast radius = 1)
rig-deployrunssops decrypt→ writes~/.rig/current/env+ adds Claude JSON to keychain viasecurity add-generic-password -U- Renewal: long-lived OAuth refresh in SOPS; renewal script exchanges, re-encrypts, commits, OpenTofu apply
- Bootstrap: first-time Mac onboarding needs operator to generate age key, register pubkey in
dashecorp/infra, run firstrig-deploymanually (documented one-time step)
2. Code rev pinning (mirror of rig image-pin chore)¶
~/.rig/current/repos/automate-e/is checked out at an explicit SHA- New auto-pin chore workflow on
automate-ecuts a PR indashecorp/infrabumping themac_fleet_automate_e_shavariable on eachautomate-emain push - Same admin-merge path as today's
chore: pin rig-conductor image to sha-XXXXPRs - On PR merge →
rig-deploynext run picks it up
3. Supervisor as code¶
launchd.plist.tplrendered by OpenTofu with: image SHA, env file path, log paths, KeepAliverig-deploywrites new plist atomically →launchctl bootout gui/$(id -u)/ai.invotek.ibuild-e; launchctl bootstrap gui/$(id -u) ~/.rig/current/launchd.plist- Old plist kept in
~/.rig/releases/<prev>/for rollback
4. Deploy trigger¶
Two options, recommend pull:
- Pull (recommended): a separate launchd LaunchAgent
ai.invotek.rig-deployfires every 5 min, runsrig-deploy. Self-healing, no inbound network. Identical to Flux's reconcile-loop semantics. - Push: GitHub Actions workflow SSHes from a runner to each Mac. Faster but requires inbound SSH access — already have it via Tailscale, but Pull avoids ACL plumbing.
5. Drift detection¶
rig-drift-checkcron (every 15 min) compares: actual launchd plist hash, actual env hash, actualautomate-eSHA against~/.rig/current/manifest.json. Any diff → POSTsHOST_DRIFTevent to conductor/api/events.- Conductor surfaces on
/api/hosts(new endpoint) similar to/api/agents.
6. Observability¶
- Existing per-pod heartbeats keep working (no change to
rig-agent-runtimecode) - Add host-level
HOST_HEARTBEATevent fromrig-deploypost-apply with manifest hash → conductor dashboard shows fleet rollout progress - Disk, CPU, memory: shipped via existing
infra-health-mcpif already covers the Mac (verify)
7. Fleet scale-out (future N Macs)¶
- New Mac onboarding = one OpenTofu PR adding a host block + one human bootstrap step (generate age key, run first
rig-deploy) - Hostname-derived config: e.g.
ibuild-e-2.localruns same agent persona, distinctAGENT_ID(ibuild-e-2), competes on same stream — KEDA-equivalent is handled by stream consumer groups, no changes needed - Xcode version: pinned per host via OpenTofu var; bootstrap script installs via
xcodesCLI
Migration phases¶
| Phase | Deliverable | Depends on | Risk |
|---|---|---|---|
| 1 | infra#225: secrets pipeline only (manual code rev) | — | Low — narrow scope, reversible |
| 2 | launchd plist rendered by OpenTofu, rig-deploy writes it |
Phase 1 | Med — wrong plist could brick agent; rollback dir mitigates |
| 3 | Code rev auto-pin chore for automate-e |
Phase 2 | Low — same pattern as rig image-pin |
| 4 | rig-drift-check + HOST_DRIFT events + /api/hosts |
Phase 2 | Low — observability only |
| 5 | Second Mac onboarded as fleet test | Phases 1–4 | Med — Apple ID / Xcode procurement |
Open questions / decisions¶
- Apple ID per Mac vs shared? Currently 1 Mac uses 1 Apple ID for Xcode Cloud signing. Multiple Macs sharing 1 Apple ID may hit concurrent-session limits. Decision needed before phase 5.
- First-boot bootstrap automation. Do we want MDM (Apple Business Manager + Mosyle/Kandji) or stay with manual macOS install + bootstrap script? MDM is correct for >3 Macs but expensive for <5.
- Hardware cadence. When does the second Mac happen — driven by queue saturation, or proactive redundancy? Define a metric (e.g., "iBuild-E queue depth > 5 for >1h, sustained 3 days") that triggers procurement.
- Autologin / keychain unlock at boot. Today the Mac Mini auto-logs into
claudeuser and keychain is unlocked. Document this is intentional (security trade-off: agents can use keychain without operator) and put it in security model. - Xcode upgrade strategy. When a new Xcode lands, do all Macs get bumped together (fleet roll), or canary first? OpenTofu var per host enables either.
What's NOT in scope¶
- Linux runner fleet (different problem, Dev-E pods already cover it)
- iOS device test farm (separate later epic)
- Cross-platform unified persona — iBuild-E stays iOS-specialist, doesn't claim non-iOS work
Related¶
dashecorp/infra#225— Phase 1: SOPS secrets pipeline (epic, in flight)pattern_codex_oauth_bootstrap.md(claude memory) — analogous OAuth bootstrap procedure for rig podsdocs/infrastructure/agents.md— agent registry (iBuild-E entry)docs/infrastructure/credentials.md— credentials registry, updated in phase 1