Mac Runner Fleet¶

Plan for treating iBuild-E (Mac Mini) and future Mac runners as a managed fleet, with the same lifecycle discipline as Dev-E rig pods.

Status: Phase 1 shipped via dashecorp/infra#225 (SOPS secrets pipeline). This doc captures the full lifecycle plan (phases 2-5) tracked in dashecorp/infra#229. Filed under Stig-Johnny/claude-3#628.

Current state (1 Mac)¶

1 host: Mac Mini M4 (100.92.170.124, user claude)
Code: manual git pull in ~/repos/claude-3/ (agent-runner runtime lives in ~/repos/claude-3/agent-runner)
Supervisor: ai.invotek.ibuild-e.plist (launchd KeepAlive), edited by hand when env changes
Secrets: interactive claude /login writes to macOS Keychain; expires unannounced
Config: no source of truth — drifts between operator SSH sessions
Drift detection: none
Observability: in-pod heartbeat → conductor, but host-level health invisible

Constraint¶

iOS builds require Xcode → macOS-only. The OS substrate cannot move to Linux containers. But everything above the OS can mirror the rig pod pattern.

Target architecture (parallel to Dev-E rig pods)¶

dashecorp/infra (OpenTofu)
  ├── secrets/mac-fleet.sops.yaml          # Claude OAuth, GitHub App PEM, Apple ID
  ├── modules/mac-runner/
  │     ├── launchd.plist.tpl              # rendered per-host
  │     ├── deploy-script.tpl              # pull image tag, render plist, restart
  │     └── outputs.tf                     # per-host config blobs
  └── envs/mac-fleet/main.tf               # one block per Mac (ibuild-e, ibuild-e-2, …)

Mac Mini (per host):
  ~/.rig/                                  # operator-untouched, OpenTofu-managed
    ├── current/                           # symlink to versioned dir
    │   ├── launchd.plist                  # rendered
    │   ├── env                            # decrypted secrets, mode 0600
    │   ├── repos/automate-e/              # git ref pinned to SHA
    │   └── manifest.json                  # OpenTofu state echo: image_tag, secrets_rev, plist_hash
    ├── releases/<sha>/                    # prior versions, kept for rollback
    └── bin/
        ├── rig-deploy                     # pull-and-apply
        └── rig-drift-check                # compare /current vs OpenTofu state

Components¶

1. Secrets pipeline (infra#225 scope)¶

SOPS file in dashecorp/infra encrypts: Claude Code OAuth credentials JSON, ibuild-e-bot GitHub App PEM, future per-pod tokens
age recipient = a per-Mac age public key (one per host so cross-host blast radius = 1)
rig-deploy runs sops decrypt → writes ~/.rig/current/env + adds Claude JSON to keychain via security add-generic-password -U
Renewal: long-lived OAuth refresh in SOPS; renewal script exchanges, re-encrypts, commits, OpenTofu apply
Bootstrap: first-time Mac onboarding needs operator to generate age key, register pubkey in dashecorp/infra, run first rig-deploy manually (documented one-time step)

2. Code rev pinning (mirror of rig image-pin chore)¶

~/.rig/current/repos/automate-e/ is checked out at an explicit SHA
New auto-pin chore workflow on automate-e cuts a PR in dashecorp/infra bumping the mac_fleet_automate_e_sha variable on each automate-e main push
Same admin-merge path as today's chore: pin rig-conductor image to sha-XXXX PRs
On PR merge → rig-deploy next run picks it up

3. Supervisor as code¶

launchd.plist.tpl rendered by OpenTofu with: image SHA, env file path, log paths, KeepAlive
rig-deploy writes new plist atomically → launchctl bootout gui/$(id -u)/ai.invotek.ibuild-e; launchctl bootstrap gui/$(id -u) ~/.rig/current/launchd.plist
Old plist kept in ~/.rig/releases/<prev>/ for rollback

4. Deploy trigger¶

Two options, recommend pull:

Pull (recommended): a separate launchd LaunchAgent ai.invotek.rig-deploy fires every 5 min, runs rig-deploy. Self-healing, no inbound network. Identical to Flux's reconcile-loop semantics.
Push: GitHub Actions workflow SSHes from a runner to each Mac. Faster but requires inbound SSH access — already have it via Tailscale, but Pull avoids ACL plumbing.

5. Drift detection¶

rig-drift-check cron (every 15 min) compares: actual launchd plist hash, actual env hash, actual automate-e SHA against ~/.rig/current/manifest.json. Any diff → POSTs HOST_DRIFT event to conductor /api/events.
Conductor surfaces on /api/hosts (new endpoint) similar to /api/agents.

6. Observability¶

Existing per-pod heartbeats keep working (no change to rig-agent-runtime code)
Add host-level HOST_HEARTBEAT event from rig-deploy post-apply with manifest hash → conductor dashboard shows fleet rollout progress
Disk, CPU, memory: shipped via existing infra-health-mcp if already covers the Mac (verify)

7. Fleet scale-out (future N Macs)¶

New Mac onboarding = one OpenTofu PR adding a host block + one human bootstrap step (generate age key, run first rig-deploy)
Hostname-derived config: e.g. ibuild-e-2.local runs same agent persona, distinct AGENT_ID (ibuild-e-2), competes on same stream — KEDA-equivalent is handled by stream consumer groups, no changes needed
Xcode version: pinned per host via OpenTofu var; bootstrap script installs via xcodes CLI

Migration phases¶

Phase	Deliverable	Depends on	Risk
1	infra#225: secrets pipeline only (manual code rev)	—	Low — narrow scope, reversible
2	launchd plist rendered by OpenTofu, `rig-deploy` writes it	Phase 1	Med — wrong plist could brick agent; rollback dir mitigates
3	Code rev auto-pin chore for `automate-e`	Phase 2	Low — same pattern as rig image-pin
4	`rig-drift-check` + `HOST_DRIFT` events + `/api/hosts`	Phase 2	Low — observability only
5	Second Mac onboarded as fleet test	Phases 1–4	Med — Apple ID / Xcode procurement

Open questions / decisions¶

Apple ID per Mac vs shared? Currently 1 Mac uses 1 Apple ID for App Store Connect / fastlane match signing (GitHub Actions build-and-upload.yml; Xcode Cloud retired). Multiple Macs sharing 1 Apple ID may hit concurrent-session limits. Decision needed before phase 5.
First-boot bootstrap automation. Do we want MDM (Apple Business Manager + Mosyle/Kandji) or stay with manual macOS install + bootstrap script? MDM is correct for >3 Macs but expensive for <5.
Hardware cadence. When does the second Mac happen — driven by queue saturation, or proactive redundancy? Define a metric (e.g., "iBuild-E queue depth > 5 for >1h, sustained 3 days") that triggers procurement.
Autologin / keychain unlock at boot. Today the Mac Mini auto-logs into claude user and keychain is unlocked. Document this is intentional (security trade-off: agents can use keychain without operator) and put it in security model.
Xcode upgrade strategy. When a new Xcode lands, do all Macs get bumped together (fleet roll), or canary first? OpenTofu var per host enables either.

What's NOT in scope¶

Linux runner fleet (different problem, Dev-E pods already cover it)
iOS device test farm (separate later epic)
Cross-platform unified persona — iBuild-E stays iOS-specialist, doesn't claim non-iOS work

dashecorp/infra#225 — Phase 1: SOPS secrets pipeline (epic, completed)
pattern_codex_oauth_bootstrap.md (claude memory) — analogous OAuth bootstrap procedure for rig pods
docs/infrastructure/agents.md — agent registry (iBuild-E entry)
docs/infrastructure/credentials.md — credentials registry, updated in phase 1