Cairn · May 15, 2026

The Only Locked Door

How OpenClaw gives Q a real computer without letting public channels rewrite the machine it runs on · ~16 min read · Suggested by Bob engineeringoperations

architecture security ai tools

The original safety boundary for Q was simple: do not let the public-channel observer execute shell commands. That worked right up until the moment the team wanted a real teammate instead of a polite stenographer. The current design keeps the dangerous part of the old instinct while changing the mechanism: give the agent real tools, but put the tools in the right room.

architecturesecurityaitools

The old boundary was understandable, but too blunt

There was a good reason to start with a no-exec observer. Public Slack channels are semi-trusted and eventually carry prompt-injection material copied from outside systems; a lane that reads all of that should not also edit gateway config or read secrets. That early instinct was correct about the risk and wrong about the mechanism — blocking execution entirely treated all shell access as equally dangerous, when running git status is not the same thing as editing gateway config. Once the team wanted Q to publish cairns directly, the no-exec rule became a bottleneck rather than a safety boundary.

The real architectural move is “full access minus one exclusion zone”

The design spec frames the modern approach in one sentence: Q is a team member with a computer, and the locked door is Q’s own office. The protected zone is not “all execution” — it’s the control plane: agent identity and routing, operational config, secrets, cron definitions, gateway state, anything that would let the agent rewrite the rules under which it operates. Cloning a repo, building a project, opening a PR, publishing a cairn — those are legitimate teammate actions. The right answer is not to ban them but to make sure they happen in a constrained place.

Key Takeaway

Q is allowed to have a computer. The thing it is not allowed to do from semi-trusted channels is rewire its own office.

The system now has lanes, not one giant trust bucket

What makes the current model workable is not merely Docker — it is role separation. OpenClaw runs as distinct lanes: a main/admin lane for privileged control work, an observer lane for company-visible interaction, and an exec lane for confidential sandboxed work. These are not just prompt instructions; they are routed differently and operate in different execution contexts. Observer and exec can use shell tools inside non-main Docker sandboxes; main remains the place for higher-trust control-plane work. This lets the architecture distinguish “can do work” from “can change the machine that defines the work.”

Containers help, but the mounts are the real contract

A sandbox is only as honest as what it can still see. The observer-first model works because sandboxes are given the things they need for useful work and not the things they would need for self-escalation: shared repos are mounted because real work happens in repos; the earlier design mounted one shared read-only institutional MEMORY.md into every sandbox, but the current model gives main, observer, and exec each their own primary durable memory; shared knowledge is published through curated corpora (doc-vault, Cairns) rather than smuggled through a common memory mount; sandbox-local homes persist auth and caches without becoming host authority; the control plane stays outside the lane. This is where many “sandboxed” systems cheat — handing the container the host’s home directory, SSH identity, or a broad control socket, then acting surprised when the sandbox becomes a costume. Docker is the mechanism. The idea is boundary honesty.

The Cairns redesign is the best proof that the boundary is working

The clearest evidence the architecture improved is that Cairns no longer needs the old draft-then-publish relay. Previously observer drafted content while a separate cron later picked it up and published, because the observer lane was treated as too weak to finish the job. The observer-first redesign changed that: clear low-risk requests in #cairns now run end to end in the sandboxed observer lane — research, write, build, commit, push, announce — and the old publish cron was retired. Less ceremony, fewer relay hops, a cleaner path from request to outcome.

The boundary is now about trust level, not channel drama

One subtle improvement is that the design stops pretending trust is binary. Company-visible channels can ask Q to do legitimate work, privileged channels still exist for higher-trust control actions, confidential work has its own sandboxed exec lane, and the observer lane can be capable without becoming sovereign. People do not naturally sort every request into “safe to answer” and “dangerous to execute” — they ask in the channel they are already in, and the system has to absorb that without flattening every request into one trust class.

The injection problem did not go away just because the observer got a shell

Granting sandbox-local exec did not repeal the original threat model. If anything, it made prompt hygiene more important. The architecture assumes Slack messages, GitHub issues, and web content may contain manipulative instructions, so it still distinguishes what content the agent may read, what tools it may use, and which lane it is in when it reads that content. A container is not a substitute for judgment — it is a way to make bad judgment less catastrophic.

Warning

If a system reads semi-trusted content and can also rewrite its own authority model, you do not have an agent architecture. You have a delayed incident report.

Why this feels more stable than the old model

There is a Gall’s Law flavor to the current setup. The architecture evolved through smaller working systems: cautious observer boundary, explicit security-model review, non-main sandboxes, observer-first direct publish for Cairns, migration of standing automation into the sandboxed lane. The team can now ask Q to do substantive work in the lanes where that work naturally starts, without handing those lanes the keys to the whole organism.

What to take forward

The larger lesson is not “use Docker.” It is to stop confusing capability with sovereignty. If an AI teammate cannot act, people stop trusting it with real work; if it can act everywhere, people eventually regret trusting it. The useful middle is meaningful local power inside a lane that cannot redefine the broader system. The protected thing is the control plane, not shell access in the abstract. Sandboxes become real at the mount boundary. Useful security removes relay hops. The sandbox reduces blast radius; it does not make semi-trusted content trustworthy.

NIST CSRC: Least Privilege — The plain-language security principle underneath the OpenClaw lane model.
Docker Docs: Bind Mounts — Container safety depends heavily on what host paths are mounted in and with what permissions.
OWASP Cheat Sheet: LLM Prompt Injection Prevention — Reinforces the lesson that prompt discipline and backend privilege boundaries have to work together.

The first useful version of an AI teammate is almost always too timid. It can answer questions, summarize docs, and maybe open an issue, but it cannot actually touch the systems where work happens. That sounds safe until the team asks it to do anything real: publish an article, triage a repo, validate a build, or follow a thread all the way to completion.

OpenClaw hit that wall head-on. The original observer boundary was intentionally conservative: public Slack channels could reach Q, so the observer was kept away from shell execution. That reduced the blast radius, but it also meant the most visible agent lane could not finish the very tasks people naturally wanted to hand it.

The current architecture keeps the spirit of that caution while changing the shape of the solution. Instead of saying “the observer must not execute,” the system now says something more useful:

Key Takeaway

Q is allowed to have a computer. The thing it is not allowed to do from semi-trusted channels is rewire its own office.

That distinction turns out to matter. It is the difference between an agent that can operate and an agent that can self-modify.

The old boundary was understandable, but too blunt

There was a good reason to start with a no-exec observer. Public Slack channels are semi-trusted. They contain social pressure, ambiguous requests, and eventually prompt-injection material copied from outside systems. If the same lane that reads all of that can also edit host config, read secrets, and change its own routing, the architecture is asking for trouble.

That early instinct was correct about the risk and wrong about the mechanism. Blocking execution entirely treated all shell access as equally dangerous. In practice, those risks are not equal:

running git status in a repo is not the same thing as editing the gateway config
building a static site is not the same thing as reading secrets
opening a worktree is not the same thing as modifying the cron registry

Once the team wanted Q to publish cairns directly, maintain repos, and handle more operational work without human relay hops, the no-exec rule became a bottleneck rather than a safety boundary.

The real architectural move is “full access minus one exclusion zone”

The design spec for the observer and exec security model frames the modern approach in one sentence: Q is a team member with a computer, and the locked door is Q’s own office.

That means the protected zone is not “all execution.” The protected zone is the control plane:

agent identity and routing
operational config
secrets and credentials
cron definitions
gateway state
other files that would let the agent rewrite the rules under which it operates

Everything else is judged more practically. If a sandboxed agent needs to clone a repo, build a project, open a PR, publish a cairn, or inspect logs inside its own lane, those are legitimate teammate actions. The right answer is not to ban the action. The right answer is to make sure the action happens in a constrained place.

This is plain least privilege, but applied with more precision than the early setup had. NIST’s least-privilege definition is boring in the best way: grant the minimum resources and authorizations needed to perform the function. The trick is that “the function” here is not “chat in Slack.” It is “be a working teammate without host-admin powers.”

The system now has lanes, not one giant trust bucket

What makes the current model workable is not merely Docker. It is role separation.

Operationally, OpenClaw now behaves as a small set of distinct lanes:

a main/admin lane for privileged control work
an observer lane for company-visible interaction and observer-side automation
an exec lane for confidential sandboxed work and executive/private operations

The important thing is that these lanes are not just prompt instructions. They are routed differently and operate in different execution contexts. Observer and exec can use shell tools inside non-main Docker sandboxes. Main remains the place for higher-trust control-plane work.

A simple picture looks like this:

graph TD
  A["Privileged control work"] --> M["Main lane"]
  B["Company-visible Slack work"] --> O["Observer sandbox"]
  C["Confidential private work"] --> E["Exec sandbox"]

  O --> R["Shared repo bundle"]
  E --> R
  O --> OM["Observer durable memory"]
  E --> EM["Exec durable memory"]
  O --> K["Curated shared knowledge via doc-vault/QMD"]
  E --> K

  M --> P["OpenClaw control plane"]
  O -. denied .-> P
  E -. denied .-> P

This matters because it lets the architecture distinguish between “can do work” and “can change the machine that defines the work.”

Containers help, but the mounts are the real contract

A sandbox is only as honest as what it can still see. In OpenClaw’s case, the observer-first model works because the sandboxes are given the things they need for useful work and not the things they would need for self-escalation.

That is why the current Cairns flow can work end to end from the observer lane. The sandbox can read and write the Cairns repo, build the site, commit changes, push them, and announce results. It does not need direct access to the host configuration that defines Slack routing, secret injection, or global policy.

This sounds obvious, but it is the place many “sandboxed” systems cheat. They hand the container the host’s home directory, the host’s SSH identity, or a broad control socket, then act surprised when the sandbox becomes a costume.

OpenClaw’s version is more disciplined:

shared repos are mounted because real work happens in repos
each sandbox keeps its own durable memory because long-term context is lane-shaped
shared knowledge is published through curated corpora because not every durable fact should bleed across lanes
sandbox-local homes exist because auth, caches, and private notes need to persist without becoming host authority
the control plane stays outside the sandbox lane

That memory change is newer than the original sandbox write-up, but it matters. The earlier design still mounted one shared read-only institutional MEMORY.md into every sandbox. That helped Q feel like one teammate, but it also blurred the line between host/control-plane knowledge and sandbox-facing context. The current model is isolation-first: main, observer, and exec each keep their own primary durable memory, and anything that should survive across lanes is supposed to be promoted into shared docs such as doc-vault or Cairns instead of leaking through a common memory mount.

Docker is the mechanism here, not the idea. The idea is boundary honesty.

Definition

An honest sandbox is one where the mounted surface already tells you what the agent is allowed to become.

The Cairns redesign is the best proof that the boundary is working

The clearest evidence that the architecture improved is not a security diagram. It is the fact that Cairns no longer needs the old draft-then-publish relay.

Earlier, the system had an awkward split:

observer could draft content
a separate cron would later pick it up and publish
the user experience felt asynchronous even for simple requests

That design existed because the observer lane was treated as too weak to finish the job directly.

The observer-first Cairns redesign changed that. Clear low-risk requests in #cairns now run end to end in the sandboxed observer lane: research, write, build, commit, push, announce. Weekly article creation, maintenance, drift checks, and issue monitoring all moved into the same general operational model. The old publish cron was retired because it was solving yesterday’s boundary, not today’s problem.

This is a good example of how security architecture should feel when it is working: less ceremony, fewer relay hops, and a cleaner path from request to completed outcome.

Scenario: The difference between "safe" and "useful"

@Bob "Can Q just finish the cairn from here?"

@Q In the old model, no — I would draft and wait for a publisher. In the current model, yes — if the request is clear and low-risk, the observer sandbox can do the whole loop directly.

The boundary is now about trust level, not channel drama

One subtle improvement in the current design is that it stops pretending trust is binary. Public and semi-public channels are not evil, but they are not the same thing as a direct admin console either.

That leads to a healthier operating model:

company-visible channels can ask Q to do legitimate work
privileged channels still exist for higher-trust control actions
confidential work has its own sandboxed exec lane
the observer lane can be capable without becoming sovereign

This is a better fit for how teams actually work. People do not naturally sort every request into “safe to answer” and “dangerous to execute.” They ask for help in the channel they are already in. The system has to absorb that reality without flattening every request into one trust class.

It also means the architecture can be stricter in the right places. The current Cairns guidance, for example, explicitly treats thread history as something to verify rather than assume, and it separates “GitHub auth is healthy” from “there is an open PR to resume.” Those are operational guardrails that came from real testing, not theory.

The injection problem did not go away just because the observer got a shell

Granting sandbox-local exec did not repeal the original threat model. If anything, it made prompt hygiene more important.

The architecture now assumes that Slack messages, GitHub issues, comments, and web content may contain manipulative or hostile instructions. That is why the system still distinguishes between:

what content the agent may read
what tools the agent may use
which lane the agent is in when it reads that content

This is the part people often miss in container conversations. A container is not a substitute for judgment. It is a way to make bad judgment less catastrophic.

OWASP’s prompt-injection guidance is useful here because it makes the same point from the outside in: model-facing systems need both instruction discipline and backend privilege control. OpenClaw’s current model is essentially that lesson translated into everyday operations. The observer can read broad team conversation and still do real work, but the work happens inside a lane that is intentionally prevented from becoming the control plane.

Warning

If a system reads semi-trusted content and can also rewrite its own authority model, you do not have an agent architecture. You have a delayed incident report.

Why this feels more stable than the old model

There is a Gall’s Law flavor to the current setup. The architecture did not jump straight from “chatbot in Slack” to “single omnipotent agent with perfect guardrails.” It evolved through smaller working systems:

first, a cautious observer boundary
then an explicit security-model review
then non-main sandboxes
then observer-first direct publish for Cairns
then migration of standing automation into the sandboxed lane

That sequence matters. It let the team discover where the old boundary was overly restrictive and where it was still pointing at a real risk. The result is not maximal elegance. It is something better for an operating system: a model that learned from the exact places where the previous one got in the way.

The most important practical consequence is this: the team can now ask Q to do substantive work in the lanes where that work naturally starts, without handing those same lanes the keys to the whole organism.

What to take forward

The larger lesson is not “use Docker.” The larger lesson is to stop confusing capability with sovereignty.

If an AI teammate cannot act, people stop trusting it with real work. If it can act everywhere, people eventually regret trusting it at all. The useful middle is to give it meaningful local power inside a lane that cannot redefine the broader system.

That is what OpenClaw’s current sandbox architecture is trying to achieve:

The protected thing is the control plane, not shell access in the abstract. Running commands is not the same class of risk as rewriting routing, secrets, or policy.
Role-shaped lanes beat one giant trust bucket. Main, observer, and exec exist because different work starts from different trust assumptions.
Sandboxes become real at the mount boundary. What the lane can see and persist is more important than the presence of a container logo.
Useful security removes relay hops. The observer-first Cairns flow is better precisely because the architecture no longer forces a fake publish handoff.
Prompt injection is still a first-class concern. The sandbox reduces blast radius; it does not make semi-trusted content trustworthy.

Which current workflows still bounce through a human or cron relay only because an older trust boundary has not been revisited?
If we created a new agent lane tomorrow, what would its real exclusion zone be? Could we describe it as clearly as "the only locked door"?
Where are we still using behavioral instructions to simulate a boundary that should really be enforced by mounts, routing, or tool scope?

NIST CSRC: Least Privilege — Useful as the plain-language security principle underneath the OpenClaw lane model: give each entity the minimum resources and authorizations needed to do its function.
Docker Docs: Bind Mounts — Helpful for the practical point that container safety depends heavily on what host paths are mounted into the container and with what permissions.
OWASP Cheat Sheet: LLM Prompt Injection Prevention — Reinforces the core operational lesson that prompt discipline and backend privilege boundaries have to work together.

Generated by Cairns · Agent-powered with Claude

← Back to Trailhead