The failure mode was not dramatic. Q would be restarted, upgraded, or wedged for a while, and the people using it would find out by testing it in Slack. Sometimes the gateway was healthy but the model reply path was not. Sometimes a channel got the eye reaction that meant the gateway saw the message, but no visible model response followed. Sometimes the right answer was simply “wait, Q is being worked on.”

That is a normal operations problem wearing an AI hat. The fix is also normal: announce downtime, check for active work, restart in a known sequence, smoke-test the user-visible path, and say when the system is stable again.

The new piece is #q-status.

Key Takeaway

#q-status is the status light, not the machine room. It tells the team whether Q is going offline, coming back, degraded, or stable.

The missing piece was not another health check

Q already had plenty of technical checks: openclaw doctor, config validation, gateway health probes, cron inspection, Slack send tests, and security reviews. Those checks matter, but they mostly answer a question from inside the machine: “is the system healthy?”

The team needed a second answer: “should I expect Q to respond right now?”

That answer belongs in Slack, because Slack is where people discover Q’s availability in practice. A status channel gives maintenance a shared surface without putting operator detail in front of ordinary users, and without turning project channels into incident logs.

The rule is simple:

  • use #q-status for availability announcements
  • keep privileged operator coordination out of the status channel
  • use #cairns for Cairns content work
  • keep sensitive diagnostics and tokens out of status messages

Status messages are short on purpose. They should read like system broadcast notices from a teammate, not like a raw terminal dump.

Maintenance now has one path

The operator-facing implementation lives in the runbook and helper, but users mainly need the visible sequence:

  1. Preflight checks config and looks for active or recently active work.
  2. If active sessions exist, the operator gets a pushback and can choose to wait or override.
  3. Q announces in #q-status that maintenance is beginning and replies may pause.
  4. Q sends an online message to #q-status; that message is also a Slack delivery smoke test.
  5. If follow-up checks pass, Q sends a maintenance-complete note. If not, it sends a degraded note with the next checkpoint.

The important design choice is that the Slack “back online” message is not merely communication. It proves the path people actually care about: Q can send a visible Slack message after restart.

The preflight is intentionally polite

Restarting an agent can interrupt real work. A local terminal session might be in the middle of an upgrade. A Slack-triggered observer task might be drafting a cairn. A cron might have claimed a maintenance pass. The preflight cannot understand every human context, but it can spot obvious signs of active work and slow the operator down.

That is the right kind of friction. It does not forbid restart. It makes the operator say, in effect, “yes, this is worth interrupting” or “no, we can wait.”

Urgent work still gets an override. Production operations need escape hatches. The protocol is there to catch accidental disruption, not to trap an operator in ceremony when the system is already down.

Tip

The preflight is a social safety check as much as a technical one: look for active work, warn clearly, then let an authorized operator make the call.

Status is routed like ordinary observer work

#q-status is routed to the observer lane. That is deliberate. Status is a public-enough function: people need to know whether Q is available, and the messages should not require host authority or secret-bearing context.

The boundary still matters. The status channel is not where operators paste raw config, tokens, stack traces with secrets, or privileged instructions. It is the broadcast layer.

That fits the lane model from The Only Locked Door: give the agent useful capability in the right room, while keeping the control plane behind the locked door. The observer can announce, answer ordinary questions, and route people toward the right channel. The main/admin lane remains where host-level operating decisions belong.

Scheduled windows make routine work boring

The preferred maintenance windows are now:

  • Tuesday and Thursday, 7 PM to midnight MT
  • Saturday and Sunday, 6 AM to noon MT

Those are not a promise that Q only changes during those windows. Security fixes, stuck gateways, and urgent config repairs can still happen whenever they need to. The windows are a default for lower-urgency maintenance: upgrades, restarts, cron reshaping, cleanup, and smoke-test-heavy changes.

At the start and end of those windows, Q posts concise notices in #q-status. The point is not to make every window an outage. The point is to set a cultural expectation: if Q is going to blink, these are the most likely times.

The happy path and the degraded path

The happy path has two or three messages:

  1. Q is going offline for maintenance.
  2. Q is back online and Slack delivery works.
  3. Maintenance is complete and Q is stable.

The second and third can be close together. They are separate because “I can post to Slack” and “all post-restart checks passed” are not quite the same claim.

The degraded path is just as important. If the gateway returns but Slack delivery fails, the status channel may not receive the message, so the operator should use the next available channel to report the problem and keep investigating. If Slack works but cron or model replies are impaired, Q should say that plainly: back online, degraded, next check underway.

The team should never have to infer maintenance state from silence.

What future agents should remember

The operational rule is small enough to memorize:

before restart, check for active work; before going dark, announce; after restart, prove Slack; before declaring victory, run the checks.

The durable version lives in the Q maintenance runbook and the helper script on the Q host. Future agents should use those instead of reconstructing the protocol from memory.

That is the point of turning this into a cairn. Q’s operating knowledge should not live only in one active session, one Slack thread, or one operator’s head. When the system learns a better way to care for itself, that knowledge belongs in the shared library.

  1. Status is a product surface. For a Slack-native teammate, availability needs to be visible in Slack, not only in host logs.
  2. Restart starts with preflight. Check for active work and make the override explicit.
  3. The first online message is a smoke test. A post-restart `#q-status` send proves the user-visible delivery path is alive.
  4. Routine windows lower anxiety. The team knows when maintenance is likely without blocking urgent fixes outside the window.
  • Which checks should be required before Q can declare maintenance complete rather than merely online?
  • When an operator overrides active-work preflight, what should be logged for future review?
  • Should any non-Q systems share `#q-status`, or should it stay narrowly scoped to Q availability?
  1. Operator's Guide to Q - The operator appendix for private lanes, debug channels, maintenance rituals, and GWS boundaries.
  2. The Only Locked Door - The lane and sandbox model that keeps status announcements separate from control-plane authority.
  3. Three Memories, One Q - The memory-boundary article behind the rule that durable operating knowledge should be promoted into shared docs.