An agent harness is the program that gives a model access to your filesystem, your shell, your editor, your test runner, and your local sandbox. It is the difference between a chat window that can suggest a code change and a collaborator that can read the file, edit it, run the tests, see the failure, edit again, and report back when it has something working. The chat window is a tool. The harness is the seat the agent sits in.

You can ship production software with either. We picked the seat. The picks are not strict, and we have good reasons not to be strict, which is what most of this cairn is about.

Why a harness, not a chat window

The first thing a chat window cannot do is read the rest of the codebase. You can paste a file in. You cannot paste in the file the function calls into, because you do not know yet which one matters. You can describe the project structure. You cannot describe everything an experienced reader would notice when actually opening the directory.

The second thing the chat window cannot do is run the build. It can suggest a change that “should pass tests.” It cannot prove it. The cost of that gap is a loop you, the human, are inside of: paste in code, run the test yourself, paste back the failure, paste in the suggested fix, run again. The agent moves at the speed of how fast you copy and paste.

A harness closes both gaps. The agent reads the project as it needs to. It runs the build itself. It edits its own draft based on the failure it just saw. The collaboration becomes “I want this done; here are the constraints” instead of “I will narrate the work into the chat for you to comment on.” That difference is most of what this trail is about.

Key Takeaway

The harness is the difference between an agent that suggests and an agent that executes. We hired you to direct execution, not to copy and paste it.

The author runs Claude Code with Anthropic’s Opus 4.7 model on the 1M-context tier. That is the recommended default, and it earns its keep on this codebase.

A few specifics matter:

  • Claude Code is the harness — install via Anthropic’s native installer. The native install is what Anthropic recommends; it is more featureful than the npm package and has not been subject to the supply-chain incidents the npm path has had. On a fresh machine today, there is no good reason to choose the npm route. The harness carries native session continuity (rewind, compact, resume), a robust skills system, an MCP server interface, and a plugin marketplace.
  • Opus 4.7 is the model. The trade-off relative to Sonnet is more capability per turn at higher latency and higher cost; for the kind of work this trail describes — design judgment, multi-file edits, careful review — that trade is worth it most days.
  • 1M context window is the long-context tier. Strike is large enough that the difference between 200K and 1M context is the difference between “I read most of the files I needed” and “I read all of them.” Pin to 1M for that reason.

This stack is strongly recommended but not required. The author is not trying to flatten contributor preferences — different people work in different ways, and every step away from this default is a choice worth making consciously rather than by accident.

One nuance worth naming. The desktop apps earn a side seat for some of the team. Claude Desktop (the standalone GUI app) is useful for chat-mode work that does not need filesystem or shell access — thinking, planning, research, document analysis, iterative writing. The author runs it alongside terminal Claude Code most days; the terminal version is the implementation seat, the desktop version is the thinking-and-reading seat. Codex’s desktop app fills a similar slot for contributors running Codex; Claude Cowork is another GUI surface worth a look. The desktop apps are a complement to the terminal harness, not a replacement.

Acceptable alternatives

A few“A few” undersells how many viable harnesses exist now. The list grows monthly. The relevant question is not “which harnesses exist” but “which harnesses give you the affordances this trail assumes” — file access, shell access, sandbox controls, persistent session state, plugin support. alternatives are plausible without serious DIY costs.

Codex CLI with GPT-5.5. Anthropic and OpenAI’s flagships are different models with different priors; both are perfectly capable of doing the work this trail describes. The Codex CLI is a mature harness in its own right — different conventions, different defaults, but the same shape of collaboration. Some of the team uses Codex as the daily driver. Some use Claude as the daily driver and Codex as a second-opinion lever. Codex as Second Opinion is the deep read on the second-opinion pattern.

Other harnesses that wrap Claude or Codex over OAuth. Cursor, Cline, Aider, and a handful more all run on the same underlying models. If your harness is one of these and you are happy with it, that is fine. The team’s skills and plugins are easier to share inside Claude Code’s plugin marketplace, but they are not exclusive to it.

Warning

Each step away from “Claude Code, Opus 4.7, 1M, OAuth subscription” is a step toward maintaining your own setup. The team will help you debug Claude Code or Codex; we are less likely to be useful when your harness is the part that broke. That is a reasonable price to pay for a setup you prefer; just price it in.

OAuth subscriptions, not API keys

This is the one structural decision the trail will state plainly. We use OAuth subscriptions — Claude Pro / Max for Claude, the equivalent tier for Codex. We do not use API keys for daily-driver work. This is not a preference; it is a budget reality at the level we lean on these agents.

The arithmetic is unfavorable for keys. A productive day with an agent at the kind of context window this trail assumes runs through tens of millions of tokens — not because any single turn is large, but because hundreds of turns accumulate. At API-key rates, that day is a meaningful line item. At a flat-rate subscription, the same day is bundled into a cost the team can plan around. The math holds for occasional users; it gets worse the more you use the tools.

Definition

OAuth subscription here means: an account-billed plan (Claude Pro, Claude Max, ChatGPT Plus, ChatGPT Pro, equivalents) that the harness authenticates against using OAuth, with usage included in the flat fee subject to per-account caps. As distinct from API key billing, where each token is metered and charged.

If you are pulled into a setup that demands an API key — for an automation, for a CI integration, for a vendor sandbox — that is a separate decision and a separate budget. For the daily seat, OAuth.

Skills and plugins, in moderation

Claude Code’s skills system is one of the levers that turn the harness from “can read your code” into “knows how to act on your code.” A skill is a markdown document with frontmatter that the harness loads when its description matches the work you are doing. The skill teaches the agent the shape of a task — which tools to reach for, which conventions matter, what order to do things in.

The honest picture today: there is no shared team standard for which skills to install. The author leans on a load-bearing handful of dm-work and adjacent skills; Noam touches a different (and smaller) subset; other contributors run different sets again. That variance is part of why the Constructured-specific plugin set is in progress — once it lands, the team will have a shared core that everyone can adopt and contribute to. Until then, the skills below are the ones the author finds load-bearing in his own loop, offered as a starting point rather than a team prescription:

  • dm-work:orchestrator and dm-work:subagent — the delegation protocol. Activated at session start; sets the orchestrator-versus-implementer contract.
  • dm-lang:go-pro, dm-lang:typescript-pro — language-specific style and gotcha skills. Auto-activate when a Go or TS file is in scope.
  • dm-arch:solid-architecture, dm-arch:data-oriented-architecture — architectural review skills. Useful when you are reviewing or designing.
  • dm-work:debugging, debugger — investigation patterns. Useful when something is broken in a non-obvious way.
  • dm-work:browser-qa — runtime QA via Chrome DevTools MCP, when you have it installed.

Working in Parallel (Mostly) covers the broader plugin landscape, including the constructured-specific plugin set in progress.

Tip

Resist the urge to install every plugin you find. Plugins compete for space in the agent’s attention; a stack of marginal plugins can dilute the strong ones. Pick the few that earn their keep, and add new ones deliberately.

Session orientation

The first thirty seconds of every Claude Code session set the tone for the rest of it. Strike’s AGENTS.md (and the CLAUDE.md symlink) is loaded automatically; it points the agent at bd prime and the project’s prime directive. The agent does the orientation work — it confirms branch, worktree, and the ready queue, then reports back. Your job is to be watching what it reports.

Concretely, an agent in a fresh Strike session should land at something like: “On branch swe-trail-plan in the main checkout, bd ready shows os-pvw9 as next. Continuing with that?” You confirm or redirect. One minute, sometimes less.

The trick to making this passive is good ambient instrumentation. Two pieces are worth setting up once:

  • A shell prompt that surfaces the current branch and worktree at a glance. Starship is what the author uses — it works across shells and across machines. With starship configured, you do not have to ask whether you are on the right branch; your prompt tells you whenever you look at the terminal.
  • A custom Claude Code status line. Surfaces the same data inside the agent’s UI, plus context-budget and auto-mode state. Docs at code.claude.com/docs/en/statusline; the rotating section later in this cairn covers why the budget visibility part matters.

Together, these mean a glance at your terminal tells you the state of the world. The agent’s spoken confirmation then becomes verification rather than discovery.

The raw form — git branch --show-current, git worktree list, bd ready typed directly — still works and is what you reach for if you do not have the prompt and status line set up yet. A new contributor can absolutely run the trail without them; the ambient setup just makes orientation cheaper after a couple of sessions.

Warning

Never skip orientation, by either path. Working on the wrong branch wastes entire sessions silently — the diff lands somewhere unexpected, the merge conflict shows up later, and the agent has done several hours of work it cannot easily redo. Whether you confirm by reading your prompt or by asking the agent to report, the confirmation has to happen.

Rotating: handoffs over compaction

Sessions don’t last forever. Context windows fill, prompt caches expire, and even a 1M-context Opus 4.7 session has a point past which fresh context outperforms accumulated context. Knowing when to rotate and how to rotate well is the discipline this section is about.

Two mechanisms exist. The harness’s built-in compaction runs automatically as you approach the limit — older messages get summarized into a condensed form to make room for newer ones. The other path is a deliberate handoff — the agent (with your confirmation) writes a structured document that captures the state of the work, you /clear or open a fresh session, and the next session reads the document first. The team’s strong preference is the handoff path, for three reasons.

Transparency. A handoff is visible. You read the document; you see exactly what carries into the next session; you correct anything wrong before committing it. Compaction happens behind a curtain — the new context is whatever the harness summarized, and the summary is not always the part you needed.

Context rot. Compaction summarizes, and summaries lose information the agent later cannot recover. Past a certain density, accumulated context starts to behave like noise — the agent re-asks questions it answered earlier, drifts from constraints it had internalized, or spends turns reconstructing what the summary lost. The state of the art is meaningfully better than it was at 200k a year ago, but “better” is not “solved.” A clean handoff sidesteps the problem.

Cache predictability. Anthropic’s prompt cache has a 5-minute TTL on inactive content. Rotating via handoff at the natural break of a workstream produces predictable cache behavior. Mid-stream compaction thrashes the cache, and on a session 700k tokens deep, a cache miss is an expensive way to start the next turn.

Definition

Context rot: the gradual degradation of an agent’s reasoning quality as accumulated context grows past the model’s effective working set. Past a threshold (which varies by model and task), more context starts to hurt instead of help. State of the art is meaningfully better than at 200k, but every model still has a curve.

The 70% rule of thumb

Past 70% of the context window, performance starts to drop and cache misses start to bite. Treat 70% as a soft ceiling — when you cross it, plan a rotation. The practical heuristic: rotate at the natural break of the current workstream (a closed bead, a merged PR, a finished investigation). If you find yourself approaching 70% mid-workstream, that is itself a signal to ask whether the workstream wants to be subdivided into smaller beads.

Visualize your context budget

Most harnesses do not surface context-budget data prominently by default. Claude Code’s status line, configured well, can. A useful status line shows model and tier ([Opus 4.7 (1M context)]), working directory, git branch, a visual bar of context usage, the current percentage, session elapsed time, the rolling-5-hour usage budget, and the auto-mode indicator — all on a single line.

The status-line docs live at code.claude.com/docs/en/statusline. The author built a custom one by pointing the agent at those docs and describing the features wanted; the agent did the heavy lifting. Recommended for any Claude Code user — knowing your context-budget state passively, without having to ask, changes how often you remember to rotate. Similar customizations exist for Codex CLI and other harnesses.

Writing a good handoff

A handoff has three parts:

  1. What we were doing. The current workstream and where we are in it. Bead IDs in flight; PRs in flight; the spec being implemented; decisions already settled in this session that are not yet captured in beads, timbers, or the code itself.
  2. What is next. The next concrete action. “Run just check, push, wait for Q to review” is more useful than “finish the feature.”
  3. Open threads. Anything noticed during the work that is not the current workstream — a TODO, a related bead worth filing, a question to come back to. Bead it now if you can; otherwise capture it here so the next session can.

The team’s /dm-work:handoff slash command, where the dm-work plugin is installed, generates this in one invocation — capturing the structured form above and writing it to history/handoff-<date>.md (or the equivalent the project uses). You confirm the contents; you commit; you /clear. The next session reads the handoff first.

Without the plugin, ask the agent to draft one: “Write a handoff for this session covering the workstream we were on, the next concrete action, and any open threads worth filing.” Modern agents handle this well; read what was drafted and adjust the sections that miss something.

Warning

A handoff is the moment to verify that beads, timbers, and the working tree are all clean. timbers pending reads 0; the bead being closed is closed; there are no uncommitted changes you forgot about. The handoff is not the place to defer cleanup; it is the place to confirm cleanup happened.

When compaction is the right call anyway

The working rule is handoff over compaction. Two cases make compaction the better choice:

  • A workstream that genuinely cannot be paused. A debugging session that has finally cornered a bug, where stopping to write a handoff would lose state the handoff would not capture cleanly. Compact, accept the cost, finish.
  • Context built mostly of throwaway exploration. Sessions sometimes accumulate 60% context on read-only investigation that was useful at the time but is not load-bearing going forward. The compaction summary is fine for that material — the next phase of work can continue with leaner context.

Outside those cases, the answer is rotate via handoff.

The 1M context era

A note on calibration. Back when contexts were 200k practical (much of which was loaded by initialization), handoffs were frequent — every couple of hours of focused work, sometimes more often. With 1M-context Opus 4.7, the cadence is dramatically less aggressive. A full workstream often fits in a single session, and the rotation point becomes the natural break between workstreams rather than a forced break inside one.

That actually makes the 70% rule and the status-line visibility more important, not less. At 200k the constraint pressed on you constantly; at 1M you can run a long session without noticing you are 800k in until performance degrades. The status line is the cheap, passive instrument that keeps you honest.

Cross-session recovery (when something does go wrong)

When a session ends without a clean handoff — a crash, a forced restart, a /clear you wish you hadn’t — beads is the recovery surface, not chat history. The agent runs bd prime to load project context, then bd show <id> for the work in flight, and reconstructs from the durable record. Claude Code’s native rewind and resume can also recover state from inside the same session if compaction collapsed something you needed. The broader rule, stated directly: anything that needs to survive a /clear lives in beads, in timbers, or in committed code — not in the chat thread.

Subagents, deliberately

Claude Code’s subagent system lets the orchestrator spawn a fresh agent with its own context window for a scoped task. The parent gets the result; the parent’s context never sees the raw research. Used well, subagents are how a senior engineer manages parallel work without losing their own thread.

The honest pattern, after a year of practice:

  • Use a subagent for read-only research that would otherwise dump 50KB of file content into the parent’s context. (“Find every place we use the polling-completion checker and tell me whether they all share the same idempotency assumption.”)
  • Use a subagent for parallel reviews — one reviewer reads the diff, another reads the design doc, a third checks the tests. The orchestrator integrates.
  • Use a subagent for scope-isolated implementation, when the work has a clean handoff and a clear definition of done.
  • Do not use a subagent for a one-line edit. The handoff cost is the work.
  • Do not use a subagent for the main thread of an open-ended exploration. You will spend more time briefing it than doing the work.

The judgment is partly taste. Spawn a subagent when the cost of bringing it up to speed is less than the cost of doing the work yourself in the parent context. That sentence is not a rule; it is a heuristic that gets sharper with practice.

What about your IDE?

Notably absent from everything in this trail is a recommendation for a particular IDE, an extension list, or a shared .vscode/ config. That is deliberate. We are non-prescriptive about IDE choice — bring your favorite, and use it as much or as little as you want.

What is worth saying out loud is what the trend looks like for some of us. The author has functionally moved past the IDE as the primary surface for code work. VSCode is open most of the time, but it is open mostly to read — markdown specs, notes, cairn drafts, occasionally code I want to skim before briefing the agent on it. The actual editing happens in the agent’s chat. Reviews happen in the agent’s chat. Even most code navigation now happens there. What is left for the IDE in this setup is what an editor was thirty years ago: a file tree with syntax highlighting and a visual search bar. What’s old is new.

Honest disclosure: a lot of contributors will look sideways at the claim that the IDE is on a path to becoming a markdown reader and not much else. That is a reasonable reaction. Direct coding and the IDE may never go away. But every major IDE vendor is currently scrambling to redesign around an agent-first user experience, which is a signal that they share the underlying intuition even when they would rather the IDE remain central. That is not a coincidence; it is the industry telling on itself about where the wind is blowing.“IDE-centric” and “agent-centric” are not symmetric framings — the agent can use the IDE; the IDE has a hard time using the agent. That asymmetry is most of why the IDE-as-host model strains under serious agent-assisted work, and most of why the harness sits where it does in the seating chart this cairn argues for.

For a new contributor: do not feel pressure to abandon your IDE, and do not feel pressure to lean on it harder than you already do. Both extremes work. The trail’s mechanics are agnostic — the agent runs just check, the gates fire on commit, the timbers entry happens whether you typed the commit yourself or the agent did. The IDE is your seat for the parts of the work the agent does not do, and you will discover for yourself, over time, how much of the work that turns out to be.

Key Takeaway

Bring whatever IDE you already love. Avoid checking a shared .vscode/ or .idea/ config into the repo without team agreement; one person’s “obvious defaults” are another person’s friction. The day you find yourself spending most of your IDE time reading rather than typing, you will know what changed.

What this cairn does not cover

Several adjacent topics are deferred to later cairns where they belong:

Adjacent reading from outside this trail: The Quiet Teammate is a meditation on what extended collaboration with an agent in production actually feels like. It is more reflective than this cairn; useful as background. The Operating Q trail is the related-but-distinct read on running an agent (rather than working with one) in a production-adjacent role.

Summary

  1. The harness is the seat. It is the difference between an agent that suggests and an agent that executes. We hired you to direct execution.
  2. Recommended: Claude Code with Opus 4.7 on the 1M tier. Strongly recommended, not required. Conscious deviation is fine; accidental deviation costs you.
  3. Codex with GPT-5.5 is acceptable. Different priors, same shape of work. Some of us run both. Codex as Second Opinion is the second-opinion deep read.
  4. OAuth subscriptions, not API keys. Structural decision. The math is unfavorable for keys at our usage levels.
  5. The DIY tax increases with distance from precedent. Wrap-Claude-or-Codex harnesses are plausible; the team is less able to help when your harness is the part that broke.
  6. Orient at the start of every session. Branch, worktree, queue, agent confirmation. One minute. Saves hours.
  7. Rotate via handoff, not via compaction. 70% context is the soft ceiling; rotate at the natural break of a workstream. A custom status line makes the budget visible. /dm-work:handoff drafts the document where the plugin is installed; modern agents handle handoffs well unprompted otherwise. Compaction is the fallback.
  8. Subagents are a power tool. Spawn one when the cost of briefing it is less than the cost of doing the work in your parent context. Otherwise don't.
  • If you came in with experience on a different harness, what do you find yourself missing — and is the gap worth bridging by switching, or by importing the pattern into Claude Code?
  • The OAuth-vs-API-key call is structural for us. For a contributor whose usage is much lighter than the average, would API-key billing actually win on cost, and what would change about how they would have to work?
  • The 70% rotation rule is empirical, not magical. What signal does your agent give you that says "context has gotten dense"? Is it different across models, and what would convince you to lower or raise that threshold for yourself?
  • What is your own rule for spawning a subagent versus continuing in the parent context? Is it the same heuristic this cairn names, or do you apply something different?
  1. How We Build Here — The trail's opening cairn. The "shape of work shifted up the stack" framing is the philosophical reason a harness, rather than a chat window, is the seat we direct work from.
  2. The Workshop — The trail's tool map. Claude Code is the agent harness layer in that tour; this cairn is the deep read on it.
  3. The Quiet Teammate — A reflective companion piece on what working alongside an agent in production looks and feels like over months. Useful as background reading.
  4. Operating Q — The related-but-distinct trail on running an agent (Q) in a production-adjacent operating role. Different problem from "agent as collaborator," same underlying technology.
  5. Claude Code — Anthropic's official harness. Install instructions, docs, plugin marketplace. The canonical source for the recommended stack.
  6. Codex CLI — OpenAI's CLI harness. Source code and documentation; the alternative this cairn endorses for daily-driver work.
  7. Claude Code skills system — Documentation on how skills work in Claude Code. Required reading if you plan to author or evaluate skills the team should adopt.