Codex as Second Opinion
When a single model is bluffing or stuck, a second model is the cheapest unblock available · ~13 min read ~– min read · Suggested by Bob engineer
When a single model is bluffing or stuck, a second model is the cheapest unblock available. Most of the time, Codex acts as a quiet reviewer on top of whatever Claude has already produced. Sometimes it leads. The judgment between "skip the second opinion," "use Codex as solo reviewer," and "spin up a council" is more art than science.
Why a second model
Cross-model coverage catches what one model cannot see. Different training corpora and priors produce different blind spots, and the union of two reads is closer to the truth than either alone. This matters most when the first model is confident — a confidently wrong answer short-circuits the human review that would otherwise catch it, and a second model is the simplest tool for reintroducing skepticism to the loop.
The point of a second model is not that it is “better.” It sees a different shape. Disagreement between the two is the signal — when they agree confidently, you can usually trust the result; when they disagree, that’s where the work is.
The lightweight pattern: Codex as solo reviewer
The default pattern on the team is the simplest one: most PRs get a single Codex pass via codex:review, in under a minute. Codex sees the diff with no knowledge of the conversation that produced it, so it confirms what holds up under fresh eyes and flags what was only right inside the original framing. The hit rate runs roughly two-to-one useful versus noise — cheap enough that we apply it to nearly everything that ships.
When to escalate to a council
Sometimes the work needs more than two opinions. A council — multiple models in parallel, each anchored to a different viewpoint, with a convergence step — costs more in tokens and your attention. Reach for it when a debug session has stalled, when you are in unfamiliar territory, when the decision is architectural (two models agreeing on a system boundary is suspiciously easy through shared training data), or when you want forced disagreement by seeding each member with a different stance. Otherwise, default to solo Codex.
Agent teams: Claude Code’s experimental in-harness council
A council does not have to be a manual orchestration of separate harness sessions. Claude Code’s experimental agent teams feature gives you a structural council inside the harness: parallel Claude instances with their own context windows, messaging each other directly, with a lead that synthesizes. The architectural difference from subagents: subagents work for the parent and report back; teammates work with each other and the lead synthesizes the result. Enable with CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1 (v2.1.32+). It shines for adversarial debugging, multi-domain reviews, and independent module work; it does not help with sequential or tightly coupled work, same-file edits, or cross-model coverage — every teammate is still Claude, so you still want Codex (or dm-team:council) in the mix for different priors.
Research-preview caveats: /resume and /rewind do not restore in-process teammates, task status can lag, one team per lead, token costs scale linearly. Keep agent teams a power tool, not a daily tool — three to five teammates is the comfortable range.
codex:rescue
When Claude is stuck, hand the problem to Codex with a fresh framing. The codex:rescue flow packages the state, constraints, and failed attempts and asks Codex from a clean slate. Signals it’s time: the same investigation reopened twice, contradictory recommendations within ten turns, the model editing something other than what you asked, or you suggesting fixes the agent then implements. Reach for rescue early — the penalty for an unnecessary one is small; the penalty for not rescuing is the rest of the session.
More art than science
The judgment between skip, solo, council, and rescue is partly experience and partly taste; the team has stopped trying to write a rule. Heuristic: solo Codex on anything you would push to a colleague; skip on truly mechanical changes; escalate to a council when work is architectural or the session is stuck; rescue when you can feel a loop forming. Try the patterns, keep what works — we are optimizing for everyone having a loop they trust, not everyone running the same one.
- How We Build Here — The trail's opening cairn. The "deterministic constraints over taste" framing covers why a second model belongs in the loop alongside the gates.
- Claude Code as Daily Driver — The harness cairn. Codex framed there as an acceptable alternative; this cairn covers the complementary pattern of running both.
- Codex CLI — OpenAI's harness. Required reading if you have not used Codex before.
- Claude Code agent teams — Official documentation for the experimental agent teams feature, including the full list of research-preview limitations.
- LLM Council (Karpathy) — Background on the council pattern that influences
dm-team:council.
A single model can be wrong with high confidence. That is not a flaw of any specific model; it is a property of the technology. The model has priors, the priors fit some problems and miss others, and there is no internal signal that reliably distinguishes “I know this” from “I am confidently improvising.” When the gap matters — a debug session that has gone in circles, an architectural decision you would not want to redo, an unfamiliar library — the cheapest insurance is to ask a second model.
That is roughly all this cairn is about. Codex is the second model the team reaches for. The patterns below are the ones that have earned their keep over the last year of practice.
Why a second model
Cross-model coverage catches things one model cannot see. Different training corpora, different RLHF histories, different prompt-template defaults, different weights on similar trade-offs — all of that produces different blind spots. The miss is not always a bug in the first model’s reasoning. It is more often a frame-of-reference issue: Claude looks at a problem and sees one shape; Codex looks at the same problem and sees a slightly different shape, and the union is closer to the truth than either is alone.
The case where this matters most is when the first model is confident. A confidently wrong answer is the most expensive failure mode an agent has, because it short-circuits the human review that would otherwise catch it. A second model with different priors is the simplest tool we know of for reintroducing skepticism to that loop.
The point of a second model is not that it is “better” than the first. It is that it sees a different shape. Treat the disagreement between the two as the signal — when both agree confidently, you can usually trust the result; when they disagree, the disagreement is where the work is.
The lightweight pattern: Codex as solo reviewer
The default pattern on the team is the simplest one. Most pull requests get a review pass from Codex alone — no full council, no parallel subagents, just a single Codex invocation against the diff. The codex:review flow handles this, surfaces a different read than the author’s primary model, and finishes in under a minute.
What this gets you, in practice:
- Independent confirmation. Codex looks at the change with no knowledge of the conversation that produced it. If the change is right for reasons that hold up under fresh eyes, Codex will say so. If it is right only inside the conversation’s framing, Codex catches that.
- Style-and-correctness coverage. Codex tends to flag different patterns than Claude — typing tightness, error-handling gaps, naming choices, tests-that-do-not-test-what-the-name-says. Some flags are useful; some are noise; the split is roughly two-to-one in favor of useful, in our experience.
- Cheap by construction. A solo review against a diff is one round-trip. The token spend is bounded. The wall-clock cost is “while you grab coffee.”
For most PRs, that is the whole loop. You author with Claude, run codex:review before push, address what holds up, ignore what doesn’t, and move on. The team’s review-comment quality has gone up since we adopted the pattern, and the cost is low enough that we apply it to nearly everything that ships.
When to escalate to a council
Sometimes the work needs more than two opinions. A council — multiple models running in parallel as cooperating reviewers, each anchored to a slightly different viewpoint, with a final convergence step — is the heavier pattern. It costs more in tokens, more in wall-clock, and more in your attention to integrate the feedback. Reach for it when:
- A debug session has clearly stalled. One model has been chasing the same tree for several turns; the second model from
codex:reviewdid not break the loop. A council with three or four perspectives is the next escalation. - You are in unfamiliar library or concept territory. The first model’s confidence is suspect because it cannot have seen this material in proportion. Diverse priors help; a council is the structured way to get them.
- The decision is architectural. Two models agreeing on a small change is often enough. Two models agreeing on a system boundary is suspiciously easy to achieve through shared training data. A council with explicit different framings is more likely to surface the real disagreement.
- You explicitly want disagreement. A council can be configured to seed each member with a different stance (“optimize for clarity,” “optimize for performance,” “optimize for safety”), forcing the disagreement that solo review cannot reliably produce.
Outside those cases, two models is overkill, and three or more is overkill plus token spend. Default to solo Codex; escalate when the work asks for it.There is also a “council of one” pattern worth knowing about — running the same model with different system prompts to get diverse stances. Cheaper than a real council, surprisingly effective on architectural questions, and often the right intermediate step between solo Codex and a full multi-model council.
Agent teams: Claude Code’s experimental in-harness council
A council does not have to be a manual orchestration of separate harness sessions. Claude Code ships an experimental agent teams feature that gives you a structural council inside the harness itself: multiple Claude Code instances running in parallel, each with its own context window, communicating with each other directly, with a designated lead session that synthesizes findings. It is the closest thing to a council that lives natively inside the tool.
Agent teams are disabled by default. Enable them with the CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1 environment variable (or the equivalent in settings.json). Requires Claude Code v2.1.32 or later.
What you can do, in shape:
- Spawn teammates with distinct prompts (“one on UX, one on technical architecture, one playing devil’s advocate”).
- Have them research in parallel and message each other directly — unlike subagents, which only report back to the spawner. This is the architectural difference: subagents work for the parent; teammates work with each other.
- Watch them in the same terminal (
Shift+Downto cycle) or in tmux / iTerm2 split panes (teammateMode: tmuxin settings). - Reference named subagent definitions as reusable teammate roles (e.g., a
security-reviewersubagent definition can be spawned as a teammate). - Pre-approve permissions on the lead so all teammates inherit them at spawn time.
Where agent teams genuinely shine:
- Adversarial debugging. Spawn N teammates, each on a different hypothesis, with explicit instructions to challenge each other. The theory that survives the debate is much more likely to be the actual root cause than what a sequential investigation tends to find. The docs cite this as one of the strongest use cases, and it matches the author’s experience.
- Multi-domain reviews. One reviewer on security, one on performance, one on tests, running simultaneously against the same diff. The lead synthesizes.
- Independent module work. Three teammates each owning a separate file or layer; no overlap, no overwrites.
Where they do not:
- Sequential or tightly coupled work — coordination overhead exceeds the benefit.
- Same-file edits — overwrite hazard, and a real one in practice.
- Cross-model perspective coverage. Every teammate is still Claude. Diverse stances come from prompting, not from a different model. For genuine cross-model coverage (Claude and Codex and maybe a third), you still want a separate Codex pass or an external orchestrator like
dm-team:councilthat manages multiple harnesses. Agent teams are parallel; cross-model councils are complementary. Both are valid; they cover different failure modes.
Honest research-preview caveats. Session resumption (/resume, /rewind) does not restore in-process teammates — after a resume, the lead may try to message teammates that no longer exist. Task status can lag (teammates sometimes fail to mark a task complete, blocking dependents). One team per lead at a time; no nested teams; the lead session is fixed for the team’s lifetime. Token costs scale linearly with team size — each teammate is a full Claude instance with its own context. Split-pane mode requires tmux or iTerm2 (not the VS Code integrated terminal, Windows Terminal, or Ghostty). The feature is good and improving; it is not yet boring.
Practical pattern: keep agent teams a power tool, not a daily tool. The author reaches for them on stuck investigations where sequential debugging has clearly hit a wall, on cross-cutting reviews where a single reviewer would gravitate toward one type of issue, and on the rare new feature that splits cleanly into independent pieces. Three to five teammates is the comfortable range; more than five tends to scatter rather than scale.
codex:rescue
When Claude is stuck, the right move is sometimes to hand the problem to Codex with a fresh framing. The codex:rescue flow does exactly that: package the current state of the work, the constraint set, and the failed attempts so far, and ask Codex to take it from a clean slate. Sometimes Claude was missing the right prompt; sometimes Claude was missing the right model. Either way, the rescue pattern is fast and tends to work.
The honest signals that say “this is a rescue case”:
- The same investigation has been opened, abandoned, and reopened more than twice in one session.
- The model has produced contradictory recommendations within ten turns.
- The model is now editing something other than what you asked it to edit.
- You are the one suggesting the fixes and the agent is implementing them, which is the inversion of the pattern that works.
Reach for codex:rescue early when those signals appear. The penalty for an unnecessary rescue is small (you got a fresh take you did not need); the penalty for not rescuing is the rest of the session.
A rescue is not a vote of no confidence in the first model. It is a recognition that this prompt thread has hit a local minimum. Sometimes the same model with the same framing works fine after a context reset. Codex’s value in the rescue is the combination of fresh context and different priors — both halves matter.
More art than science
The judgment between “skip the second opinion,” “use Codex as solo reviewer,” “spin up a council,” and “rescue with a different model” is partly experience and partly taste. There is no rule that gets it right every time, and the team has explicitly stopped trying to write one.
A working heuristic, after a year of practice:
- Default to solo Codex review on anything you would push to a colleague. Cheap, fast, catches a real fraction of issues.
- Skip the second opinion on truly mechanical changes — formatting, dependency bumps, generated-code commits. The token spend is not earned.
- Escalate to a council when the work is architectural, when you are in unfamiliar territory, or when the first model is clearly stuck.
- Use
codex:rescuewhen a session has hit a local minimum and you can feel the loop forming.
Try the patterns. Keep what works on your work. The team is not optimizing for everyone using the same loop; we are optimizing for everyone having a loop they trust.
Summary
- A single model can be wrong with high confidence. A second model is the cheapest tool for reintroducing skepticism to the loop.
- Default to solo Codex review. The
codex:reviewflow on most PRs catches a real fraction of issues at low cost. This is the pattern the team uses most. - Escalate to a council when the work is architectural, you are in unfamiliar territory, the session has stalled, or you explicitly want forced disagreement.
- Agent teams are Claude Code's experimental in-harness council. Multiple Claude instances, parallel context windows, direct teammate-to-teammate messaging. Best for adversarial debugging and multi-domain reviews. Worth knowing about; not a daily tool. Cross-model coverage still wants Codex (or
dm-team:council) in the mix. - Reach for
codex:rescueearly when a session has hit a local minimum. The penalty for an unnecessary rescue is small; the penalty for not rescuing is the rest of the session. - The choice is more art than science. Try the patterns; keep what works on your work; do not chase a universal rule.
- The "default to solo Codex review" pattern depends on the second model being meaningfully different from the first. If both models are trained on overlapping data with similar RLHF, what's the practical floor on cross-model coverage, and how would we tell when we have hit it?
- Councils are expensive in attention as well as tokens — the human has to integrate the feedback. What council size has the best signal-to-attention ratio in your experience, and how does that change when the council members are seeded with different stances?
- Agent teams give you parallel Claude instances; a cross-model council gives you different priors. If you had to pick one for the next genuinely stuck investigation, which would you reach for, and what would change your mind?
- If you had to invent a fourth pattern beyond "skip / solo / council / rescue," what gap in the existing three would you be filling? Where do they leave you wanting?
- How We Build Here — The trail's opening cairn. The "deterministic constraints over taste" framing covers why a second model belongs in the loop alongside the gates: both are mechanisms for catching what one human-plus-model misses.
- The Workshop — The trail's tool map. Codex sits in the optional/recommended layer; this cairn is the deep read on the second-opinion pattern that justifies its place there.
- Claude Code as Daily Driver — The harness cairn. Codex framed there as an acceptable alternative; this cairn covers the complementary pattern of running both.
- Your Box and Your Trust Model — The trust-posture cairn. Worth re-reading before extending your auto-allow list to cover Codex CLI invocations.
- Codex CLI — OpenAI's harness. Source code, install instructions, configuration reference. Required reading if you have not used Codex before.
- LLM Council (Karpathy) — Background reading on the council pattern: multiple models cooperating as reviewers, with a convergence step. The team's
dm-team:councilskill is influenced by this work. - Claude Code agent teams — Official documentation for the experimental agent teams feature. Enable flag, display modes, task-list semantics, permissions inheritance, and the full list of research-preview limitations to consult before reaching for it on live work.
- Building a C compiler with a team of parallel Claudes — Anthropic engineering write-up on a real agent-teams use case. Useful as a worked example of where the parallel-Claude pattern earned its keep.
Generated by Cairns · Agent-powered with Claude