Skip to content

Mac Runner Fleet

Plan for treating iBuild-E (Mac Mini) and future Mac runners as a managed fleet, with the same lifecycle discipline as Dev-E rig pods.

Status: planning. Phase 1 is in flight via dashecorp/infra#225 (secrets pipeline). This doc captures the full lifecycle plan that infra#225 is the first phase of. Filed under Stig-Johnny/claude-3#628.

Current state (1 Mac)

  • 1 host: Mac Mini M4 (100.92.170.124, user claude)
  • Code: manual git pull in ~/repos/automate-e/
  • Supervisor: ai.invotek.ibuild-e.plist (launchd KeepAlive), edited by hand when env changes
  • Secrets: interactive claude /login writes to macOS Keychain; expires unannounced
  • Config: no source of truth — drifts between operator SSH sessions
  • Drift detection: none
  • Observability: in-pod heartbeat → conductor, but host-level health invisible

Constraint

iOS builds require Xcode → macOS-only. The OS substrate cannot move to Linux containers. But everything above the OS can mirror the rig pod pattern.

Target architecture (parallel to Dev-E rig pods)

dashecorp/infra (OpenTofu)
  ├── secrets/mac-fleet.sops.yaml          # Claude OAuth, GitHub App PEM, Apple ID
  ├── modules/mac-runner/
  │     ├── launchd.plist.tpl              # rendered per-host
  │     ├── deploy-script.tpl              # pull image tag, render plist, restart
  │     └── outputs.tf                     # per-host config blobs
  └── envs/mac-fleet/main.tf               # one block per Mac (ibuild-e, ibuild-e-2, …)

Mac Mini (per host):
  ~/.rig/                                  # operator-untouched, OpenTofu-managed
    ├── current/                           # symlink to versioned dir
    │   ├── launchd.plist                  # rendered
    │   ├── env                            # decrypted secrets, mode 0600
    │   ├── repos/automate-e/              # git ref pinned to SHA
    │   └── manifest.json                  # OpenTofu state echo: image_tag, secrets_rev, plist_hash
    ├── releases/<sha>/                    # prior versions, kept for rollback
    └── bin/
        ├── rig-deploy                     # pull-and-apply
        └── rig-drift-check                # compare /current vs OpenTofu state

Components

1. Secrets pipeline (infra#225 scope)

  • SOPS file in dashecorp/infra encrypts: Claude Code OAuth credentials JSON, ibuild-e-bot GitHub App PEM, future per-pod tokens
  • age recipient = a per-Mac age public key (one per host so cross-host blast radius = 1)
  • rig-deploy runs sops decrypt → writes ~/.rig/current/env + adds Claude JSON to keychain via security add-generic-password -U
  • Renewal: long-lived OAuth refresh in SOPS; renewal script exchanges, re-encrypts, commits, OpenTofu apply
  • Bootstrap: first-time Mac onboarding needs operator to generate age key, register pubkey in dashecorp/infra, run first rig-deploy manually (documented one-time step)

2. Code rev pinning (mirror of rig image-pin chore)

  • ~/.rig/current/repos/automate-e/ is checked out at an explicit SHA
  • New auto-pin chore workflow on automate-e cuts a PR in dashecorp/infra bumping the mac_fleet_automate_e_sha variable on each automate-e main push
  • Same admin-merge path as today's chore: pin rig-conductor image to sha-XXXX PRs
  • On PR merge → rig-deploy next run picks it up

3. Supervisor as code

  • launchd.plist.tpl rendered by OpenTofu with: image SHA, env file path, log paths, KeepAlive
  • rig-deploy writes new plist atomically → launchctl bootout gui/$(id -u)/ai.invotek.ibuild-e; launchctl bootstrap gui/$(id -u) ~/.rig/current/launchd.plist
  • Old plist kept in ~/.rig/releases/<prev>/ for rollback

4. Deploy trigger

Two options, recommend pull:

  • Pull (recommended): a separate launchd LaunchAgent ai.invotek.rig-deploy fires every 5 min, runs rig-deploy. Self-healing, no inbound network. Identical to Flux's reconcile-loop semantics.
  • Push: GitHub Actions workflow SSHes from a runner to each Mac. Faster but requires inbound SSH access — already have it via Tailscale, but Pull avoids ACL plumbing.

5. Drift detection

  • rig-drift-check cron (every 15 min) compares: actual launchd plist hash, actual env hash, actual automate-e SHA against ~/.rig/current/manifest.json. Any diff → POSTs HOST_DRIFT event to conductor /api/events.
  • Conductor surfaces on /api/hosts (new endpoint) similar to /api/agents.

6. Observability

  • Existing per-pod heartbeats keep working (no change to rig-agent-runtime code)
  • Add host-level HOST_HEARTBEAT event from rig-deploy post-apply with manifest hash → conductor dashboard shows fleet rollout progress
  • Disk, CPU, memory: shipped via existing infra-health-mcp if already covers the Mac (verify)

7. Fleet scale-out (future N Macs)

  • New Mac onboarding = one OpenTofu PR adding a host block + one human bootstrap step (generate age key, run first rig-deploy)
  • Hostname-derived config: e.g. ibuild-e-2.local runs same agent persona, distinct AGENT_ID (ibuild-e-2), competes on same stream — KEDA-equivalent is handled by stream consumer groups, no changes needed
  • Xcode version: pinned per host via OpenTofu var; bootstrap script installs via xcodes CLI

Migration phases

Phase Deliverable Depends on Risk
1 infra#225: secrets pipeline only (manual code rev) Low — narrow scope, reversible
2 launchd plist rendered by OpenTofu, rig-deploy writes it Phase 1 Med — wrong plist could brick agent; rollback dir mitigates
3 Code rev auto-pin chore for automate-e Phase 2 Low — same pattern as rig image-pin
4 rig-drift-check + HOST_DRIFT events + /api/hosts Phase 2 Low — observability only
5 Second Mac onboarded as fleet test Phases 1–4 Med — Apple ID / Xcode procurement

Open questions / decisions

  1. Apple ID per Mac vs shared? Currently 1 Mac uses 1 Apple ID for Xcode Cloud signing. Multiple Macs sharing 1 Apple ID may hit concurrent-session limits. Decision needed before phase 5.
  2. First-boot bootstrap automation. Do we want MDM (Apple Business Manager + Mosyle/Kandji) or stay with manual macOS install + bootstrap script? MDM is correct for >3 Macs but expensive for <5.
  3. Hardware cadence. When does the second Mac happen — driven by queue saturation, or proactive redundancy? Define a metric (e.g., "iBuild-E queue depth > 5 for >1h, sustained 3 days") that triggers procurement.
  4. Autologin / keychain unlock at boot. Today the Mac Mini auto-logs into claude user and keychain is unlocked. Document this is intentional (security trade-off: agents can use keychain without operator) and put it in security model.
  5. Xcode upgrade strategy. When a new Xcode lands, do all Macs get bumped together (fleet roll), or canary first? OpenTofu var per host enables either.

What's NOT in scope

  • Linux runner fleet (different problem, Dev-E pods already cover it)
  • iOS device test farm (separate later epic)
  • Cross-platform unified persona — iBuild-E stays iOS-specialist, doesn't claim non-iOS work
  • dashecorp/infra#225 — Phase 1: SOPS secrets pipeline (epic, in flight)
  • pattern_codex_oauth_bootstrap.md (claude memory) — analogous OAuth bootstrap procedure for rig pods
  • docs/infrastructure/agents.md — agent registry (iBuild-E entry)
  • docs/infrastructure/credentials.md — credentials registry, updated in phase 1