The Status Light
How Q announces maintenance, checks for active work, and proves it came back online · ~11 min read ~– min read · Suggested by Bob businesstechnicaloperations
A teammate who can restart itself needs a way to tell the room when the lights are about to blink. The new Q maintenance protocol makes downtime visible, gives operators a low-friction restart path, and turns the first post-restart Slack message into a real smoke test.
The missing piece was not another health check
Q already had health checks. What was missing was social reliability: before a restart, people needed to know whether Q might interrupt work; during maintenance, they needed a visible status channel; after maintenance, the first Slack send needed to prove delivery was alive again. The new protocol adds #q-status as that public status light.
Maintenance now has one path
The operator starts with preflight. Maintenance checks for active or recently active work before Q goes dark. If work is in flight, the operator gets useful friction and can choose to wait or explicitly override. The user-facing shape is simple: Q says it is going offline, comes back with a visible Slack smoke test, then reports complete or degraded status.
The preflight is intentionally polite
The useful friction is at the beginning. Checking active sessions before restart prevents avoidable interruptions. The override path keeps urgent work possible. The offline/online/complete sequence keeps communication tight without requiring every operator to remember the whole ritual.
Status is routed like ordinary observer work
#q-status is not an admin room. It is routed to the observer lane and is meant for availability notices, not secrets or control-plane discussion. Operator coordination belongs elsewhere; Cairns content work stays in its content lane. Status messages are deliberately short so they do not become another dashboard people learn to ignore.
Scheduled windows make routine work boring
Regular maintenance windows now have a predictable rhythm. Tuesday and Thursday evenings, plus weekend mornings, are the preferred periods for lower-urgency restarts and upgrade work. Work can still happen outside them when needed; the point is to make routine maintenance boring and visible.
The happy path and the degraded path
The online message and the complete message say different things. The first proves Slack delivery works after restart; the second says the broader checks passed. If checks fail, Q should say degraded rather than making the team infer state from silence.
What future agents should remember
The durable rule is short. Before restart, check for active work; before going dark, announce; after restart, prove Slack; before declaring victory, run the checks. Future agents should use the runbook and helper rather than reconstructing the sequence from memory.
The failure mode was not dramatic. Q would be restarted, upgraded, or wedged for a while, and the people using it would find out by testing it in Slack. Sometimes the gateway was healthy but the model reply path was not. Sometimes a channel got the eye reaction that meant the gateway saw the message, but no visible model response followed. Sometimes the right answer was simply “wait, Q is being worked on.”
That is a normal operations problem wearing an AI hat. The fix is also normal: announce downtime, check for active work, restart in a known sequence, smoke-test the user-visible path, and say when the system is stable again.
The new piece is #q-status.
#q-status is the status light, not the machine room. It tells the team whether Q is going offline, coming back, degraded, or stable.
The missing piece was not another health check
Q already had plenty of technical checks: openclaw doctor, config validation, gateway health probes, cron inspection, Slack send tests, and security reviews. Those checks matter, but they mostly answer a question from inside the machine: “is the system healthy?”
The team needed a second answer: “should I expect Q to respond right now?”
That answer belongs in Slack, because Slack is where people discover Q’s availability in practice. A status channel gives maintenance a shared surface without putting operator detail in front of ordinary users, and without turning project channels into incident logs.
The rule is simple:
- use
#q-statusfor availability announcements - keep privileged operator coordination out of the status channel
- use
#cairnsfor Cairns content work - keep sensitive diagnostics and tokens out of status messages
Status messages are short on purpose. They should read like system broadcast notices from a teammate, not like a raw terminal dump.
Maintenance now has one path
The operator-facing implementation lives in the runbook and helper, but users mainly need the visible sequence:
- Preflight checks config and looks for active or recently active work.
- If active sessions exist, the operator gets a pushback and can choose to wait or override.
- Q announces in
#q-statusthat maintenance is beginning and replies may pause. - Q sends an online message to
#q-status; that message is also a Slack delivery smoke test. - If follow-up checks pass, Q sends a maintenance-complete note. If not, it sends a degraded note with the next checkpoint.
The important design choice is that the Slack “back online” message is not merely communication. It proves the path people actually care about: Q can send a visible Slack message after restart.
The preflight is intentionally polite
Restarting an agent can interrupt real work. A local terminal session might be in the middle of an upgrade. A Slack-triggered observer task might be drafting a cairn. A cron might have claimed a maintenance pass. The preflight cannot understand every human context, but it can spot obvious signs of active work and slow the operator down.
That is the right kind of friction. It does not forbid restart. It makes the operator say, in effect, “yes, this is worth interrupting” or “no, we can wait.”
Urgent work still gets an override. Production operations need escape hatches. The protocol is there to catch accidental disruption, not to trap an operator in ceremony when the system is already down.
The preflight is a social safety check as much as a technical one: look for active work, warn clearly, then let an authorized operator make the call.
Status is routed like ordinary observer work
#q-status is routed to the observer lane. That is deliberate. Status is a public-enough function: people need to know whether Q is available, and the messages should not require host authority or secret-bearing context.
The boundary still matters. The status channel is not where operators paste raw config, tokens, stack traces with secrets, or privileged instructions. It is the broadcast layer.
That fits the lane model from The Only Locked Door: give the agent useful capability in the right room, while keeping the control plane behind the locked door. The observer can announce, answer ordinary questions, and route people toward the right channel. The main/admin lane remains where host-level operating decisions belong.
Scheduled windows make routine work boring
The preferred maintenance windows are now:
- Tuesday and Thursday, 7 PM to midnight MT
- Saturday and Sunday, 6 AM to noon MT
Those are not a promise that Q only changes during those windows. Security fixes, stuck gateways, and urgent config repairs can still happen whenever they need to. The windows are a default for lower-urgency maintenance: upgrades, restarts, cron reshaping, cleanup, and smoke-test-heavy changes.
At the start and end of those windows, Q posts concise notices in #q-status. The point is not to make every window an outage. The point is to set a cultural expectation: if Q is going to blink, these are the most likely times.
The happy path and the degraded path
The happy path has two or three messages:
- Q is going offline for maintenance.
- Q is back online and Slack delivery works.
- Maintenance is complete and Q is stable.
The second and third can be close together. They are separate because “I can post to Slack” and “all post-restart checks passed” are not quite the same claim.
The degraded path is just as important. If the gateway returns but Slack delivery fails, the status channel may not receive the message, so the operator should use the next available channel to report the problem and keep investigating. If Slack works but cron or model replies are impaired, Q should say that plainly: back online, degraded, next check underway.
The team should never have to infer maintenance state from silence.
What future agents should remember
The operational rule is small enough to memorize:
before restart, check for active work; before going dark, announce; after restart, prove Slack; before declaring victory, run the checks.
The durable version lives in the Q maintenance runbook and the helper script on the Q host. Future agents should use those instead of reconstructing the protocol from memory.
That is the point of turning this into a cairn. Q’s operating knowledge should not live only in one active session, one Slack thread, or one operator’s head. When the system learns a better way to care for itself, that knowledge belongs in the shared library.
- Status is a product surface. For a Slack-native teammate, availability needs to be visible in Slack, not only in host logs.
- Restart starts with preflight. Check for active work and make the override explicit.
- The first online message is a smoke test. A post-restart `#q-status` send proves the user-visible delivery path is alive.
- Routine windows lower anxiety. The team knows when maintenance is likely without blocking urgent fixes outside the window.
- Which checks should be required before Q can declare maintenance complete rather than merely online?
- When an operator overrides active-work preflight, what should be logged for future review?
- Should any non-Q systems share `#q-status`, or should it stay narrowly scoped to Q availability?
- Operator's Guide to Q - The operator appendix for private lanes, debug channels, maintenance rituals, and GWS boundaries.
- The Only Locked Door - The lane and sandbox model that keeps status announcements separate from control-plane authority.
- Three Memories, One Q - The memory-boundary article behind the rule that durable operating knowledge should be promoted into shared docs.
Generated by Cairns · Agent-powered with Claude