The Last False Complete
Why field workflows need a soft-complete state before they earn the right to be done · ~15 min read ~– min read · Suggested by Q technicaloperations
In field operations, "all current tasks are completed" is not the same thing as "the work is truly over." Osprey Strike's ECO polling model makes that distinction explicit with a completion grace period, late-clone detection, and controlled reactivation, which turns a brittle boolean into an operationally honest workflow.
“All tasks are completed” is an observation, not a verdict
The polling logic in Strike is refreshingly plain about what it knows: it asks whether all currently visible tasks for an ECO’s subsector are completed. That is not the same question as whether the incident is over forever. The completion checker marks COMPLETED when all tasks complete, and reverts to IN_PROGRESS when new tasks appear or incomplete tasks reappear. The domain invariants explicitly allow COMPLETED → IN_PROGRESS — the giveaway that reactivation is a business rule, not an accidental loophole.
A field-work system should treat completion as a claim under observation, not a permanent fact discovered once.
Soft-complete is the honest state when the world may still move
The grace period exists because the workflow has inertia. Field work doesn’t materialize as one neat atomic transaction — investigation, follow-on, reassignment, and verification arrive in bursts. Strike keeps polling COMPLETED ECOs for a configurable window (24 hours by default), giving a three-part progression instead of a brittle binary: Active (OPEN/IN_PROGRESS, polling), Soft-complete (COMPLETED within grace, still polling), Hard-complete (grace expired, polling stops). That middle state is where the system admits uncertainty without becoming useless.
Late clones are not edge cases, they are workflow reality
Late-clone detection is the most practical detail. The polling service tracks task counts and distinguishes “the same finished set” from “a new task has joined the subsector” — because field techs routinely finish an investigation before creating the follow-on repair. The investigation establishes what is wrong; only then does the next piece of work become concrete. A naive “first moment everything is completed equals final completion” rule would make ordinary workflow look like a bug.
Reactivation should be legal, visible, and boring
Bad systems reactivate awkwardly — smuggling reopened work through hidden flags, comments, or ad hoc exceptions because the original state machine assumed completion was irreversible. Strike makes reactivation a valid transition (COMPLETED → IN_PROGRESS for grace-period activity) and emits explicit reasons (all tasks in subsector completed, new task detected during grace period, task reopened during grace period) so debugging is plain SQL, not tribal memory. A reactivation path is not a concession to messiness; it is how a disciplined system handles a world that changes after the first apparent ending.
The real design pattern is not saga magic, it is durable observation
The polling docs are candid that completion checking is embedded in the polling service and directly updates the read model — a pragmatic compromise rather than a grand workflow-engine abstraction. The valuable pattern is not “use a fancy saga framework”; it is: keep durable state, keep timers, tolerate retries and duplicates, observe long enough to catch late truth, and make status transitions explicit. The pager process manager reaches the same conclusion with database-backed state and deadlines. Different mechanism, same instinct: durable observation beats brittle optimism.
Idempotency is not optional when completion can be revisited
As soon as a system allows late events, reopened work, and repeated polling, duplicate handling becomes table stakes. A polling cycle may see the same completed task set repeatedly; a webhook may deliver an event more than once. The response cannot be panic — it has to be idempotency. Strike already leans this way: the pager process manager handles duplicates explicitly, the completion checker treats same-state updates as no-ops. The grace-period pattern is not just about timers — it’s about accepting that the same workflow may be observed many times before it is truly quiet.
The operator experience depends on naming the states correctly
Internal truth and user-facing meaning are not always identical. The UX should avoid implying irreversible finality during a grace-period window — not necessarily by exposing a literal SOFT_COMPLETED label, but through a “monitoring for follow-on work” indicator, timeline entries explaining reactivation reasons, and language that distinguishes “all current tasks completed” from “case closed permanently.”
The bigger lesson is about truth in long-running systems
The completion grace period is a small feature with a large worldview behind it: status is a model of reality, not reality itself; terminal states should be earned, not assumed; long-running workflows need timers as much as transitions. In operational domains the cost of premature certainty is not a weird record — it is the wrong team believing an outage is resolved, the wrong dashboard going quiet, the wrong escalation path standing down too soon. The best workflow systems aren’t the ones that declare victory fastest; they’re the ones that wait just long enough to avoid being fooled.
- Osprey Strike ECO workflow docs — internal architecture and lifecycle writeup covering subsector polling, completion grace period, and late-clone reality.
- Osprey Strike polling package docs (
polling/doc.go) — the candid note on the pragmatic tradeoff: completion checking embedded in polling, directly updating the read model. - Osprey Strike completion checker (
completion.go) — the concrete implementation of completion, reactivation, and grace-period finalization. - Osprey Strike ECO invariants (
invariants.go) — shows thatCOMPLETED → IN_PROGRESSis intentional, not an accidental loophole. - Temporal documentation, Workflow Execution overview — supporting context on durable long-running workflows and why timers and persisted execution state matter.
Distributed systems people love clean terminal states. COMPLETED feels good. It closes dashboards, quiets alerts, and lets everyone move on.
Field operations are less polite.
A technician finishes the investigation task, then clones a repair task ten minutes later. A supervisor reopens a task because the original closure was premature. A polling loop catches a system where every visible task is complete, but the reality on the ground is still changing. If the platform treats the first moment of apparent completion as final truth, it will lie to operators at exactly the moment they most need accurate status.
That is why Osprey Strike’s ECO workflow does something subtle and important. It allows an ECO to become COMPLETED, but only as a soft completion during a grace period. During that window, polling continues. New tasks can reactivate the ECO. Reopened tasks can reactivate the ECO. Only after the grace period expires without new activity does the system earn a hard stop.
This is one of those design choices that looks like implementation detail until you imagine the alternative. Then it becomes obvious.
“All tasks are completed” is an observation, not a verdict
The polling logic in Strike is refreshingly plain about what it knows. It checks the tasks associated with an ECO’s subsector, determines whether any tasks exist, and asks a narrow question: are they all currently completed?
That is useful, but it is not the same as asking whether the incident is over forever.
The internal completion checker makes the distinction explicit:
- if all tasks are completed and the ECO is not already completed, mark it
COMPLETED - if the ECO is completed and new tasks appear, move it back to
IN_PROGRESS - if the ECO is completed and incomplete tasks reappear, move it back to
IN_PROGRESS
That logic is small. The idea behind it is bigger.
The domain invariants allow COMPLETED -> IN_PROGRESS specifically to support reactivation during the grace period. That is a strong sign that the behavior is a business rule, not an accidental side effect.
A field-work system should treat completion as a claim under observation, not a permanent fact discovered once.
If you collapse those ideas, you get false confidence. The UI says complete. The NOC relaxes. Then a follow-on task appears and everyone has to mentally unwind what the software prematurely declared settled.
Soft-complete is the honest state when the world may still move
The grace period exists because the workflow has inertia. Work in the field does not materialize as one neat atomic transaction. Investigation, follow-on work, reassignment, and verification often arrive in bursts.
Strike’s Render polling model keeps polling COMPLETED ECOs for a configurable period, twenty-four hours by default in the documented workflow. During that time, the ECO is done enough to communicate progress, but not done enough to abandon observation.
That gives the system a three-part progression instead of a brittle binary:
- Active work:
OPENorIN_PROGRESS, poll continuously. - Soft-complete:
COMPLETED, but still within grace period, keep polling. - Hard-complete: grace period expired with no new activity, stop polling.
A simple timeline looks like this:
flowchart LR A[Tasks active] --> B[All visible tasks completed] B --> C[Mark ECO COMPLETED] C --> D[Grace period continues polling] D -->|New task appears| E[Reactivate to IN_PROGRESS] D -->|Task reopened| E D -->|No activity until timer expires| F[Hard complete, stop polling]
That middle state is doing the real work. It is where the system admits uncertainty without becoming useless.
Late clones are not edge cases, they are workflow reality
One of the most practical details in the internal docs is late-clone detection. The polling service tracks task counts and can distinguish between “the same finished set of tasks” and “a new task has now joined the subsector.”
That matters because a field technician may finish an investigation task before creating the follow-on repair task. Operationally, that is not bizarre. It is often exactly how the work happens. The investigation establishes what is wrong. Only then does the next piece of work become concrete enough to create.
If the platform used a naive rule like “first moment everything is completed equals final completion,” that ordinary workflow would look like a bug. The software would oscillate between certainty and embarrassment.
Instead, Strike makes room for the real sequence:
sequenceDiagram participant N as NOC participant S as Strike participant R as Render participant F as Field tech N->>S: Create ECO S->>R: Create investigation task F->>R: Complete investigation S->>S: Detect all current tasks complete S->>S: Mark COMPLETED, keep polling in grace period F->>R: Clone follow-on repair task S->>S: Detect new task during grace period S->>S: Reactivate ECO to IN_PROGRESS F->>R: Complete repair S->>S: Mark COMPLETED again S->>S: Grace period expires with no change
This is a better model because it reflects the shape of work rather than forcing work to impersonate software neatness.
Reactivation should be legal, visible, and boring
A lot of bad systems reactivate awkwardly. They smuggle reopened work through hidden flags, comments, or ad hoc exceptions because the original state machine assumed completion was irreversible.
Strike does the healthier thing. It makes reactivation a valid status transition.
The domain invariants explicitly allow:
OPEN -> IN_PROGRESS -> COMPLETEDas the normal pathCOMPLETED -> IN_PROGRESSwhen a task is reopened or a new task appears during grace period
That choice matters for three reasons.
First, it keeps the domain model honest. If reopened work is real, it belongs in the state machine.
Second, it improves debugging. You can reason about why the status changed because the system emits a reason like:
all tasks in subsector completednew task detected during grace periodtask reopened during grace period
Third, it prevents people from inventing shadow status models in spreadsheets, chat threads, or tribal memory.
If your operators routinely say things like “ignore the green check, it’s actually still live,” the software has already lost the status-model argument.
A reactivation path is not a concession to messiness. It is how a disciplined system handles a world that changes after the first apparent ending.
The real design pattern is not saga magic, it is durable observation
The polling package documentation says something I appreciate: this completion logic is a pragmatic compromise rather than a grand workflow-engine abstraction.
Instead of introducing a separate process manager framework for every nuance, Strike embeds the completion checker inside the polling service and directly updates the read model. The docs are candid that this is less pure than a fully event-choreographed design, but simpler for the current problem.
That tradeoff is worth noticing because teams often over-romanticize orchestration patterns.
The valuable pattern here is not “use a fancy saga framework.” The valuable pattern is:
- keep durable state,
- keep timers,
- tolerate retries and duplicates,
- observe for long enough to catch late truth,
- and make status transitions explicit.
That same philosophy shows up elsewhere in the system. The pager process manager uses database-backed process state and deadlines rather than ephemeral in-memory orchestration. The ECO completion checker uses a timer-shaped grace period rather than pretending the first completed snapshot is final.
Different mechanism, same instinct: durable observation beats brittle optimism. This is one place where official workflow-engine literature is helpful as supporting context. Systems like Temporal emphasize durable workflow execution and timers because long-running work rarely behaves like a single request-response function. Strike reaches a similar conclusion with plainer building blocks.
Idempotency is not optional when completion can be revisited
As soon as a system allows late events, reopened work, and repeated polling, it has to become serious about duplicate handling.
That is not theory. It is table stakes.
A polling cycle may see the same completed task set multiple times. A webhook or message bus may deliver an event more than once. A task list may be unchanged for several polls and then change in a way that looks suspiciously like an older state returning.
The response cannot be panic. It has to be idempotency.
Strike’s architecture already leans this way. The pager process manager handles duplicate events explicitly. The completion checker treats same-state updates as no-ops where appropriate. External workflow literature from Stripe and Hookdeck makes the same broader point for API and webhook systems: retries are normal, so correctness depends on making repeated inputs safe.
In other words, the grace period pattern is not just about timers. It is about accepting that the same workflow may be observed, updated, and re-observed many times before it is truly quiet.
The operator experience depends on naming the states correctly
There is also a product lesson here. Internal truth and user-facing meaning are not always identical.
Strike already separates technical status from operator-friendly display status in other parts of the system. The same discipline applies here. If the software internally knows an ECO is in a grace-period completion window, the UX should avoid implying irreversible finality when the system itself is still watching for follow-on work.
That does not necessarily mean exposing a literal state named SOFT_COMPLETED. It does mean designing the operator experience so that people are not surprised when an apparently complete ECO becomes active again.
Useful patterns might include:
- a subtle “monitoring for follow-on work” indicator,
- timeline entries explaining reactivation reasons,
- and language that distinguishes “all current tasks completed” from “case closed permanently.”
The right UI copy can save a lot of unnecessary human confusion.
The bigger lesson is about truth in long-running systems
The completion grace period is a small feature with a large worldview behind it.
That worldview says:
- status is a model of reality, not reality itself,
- terminal states should be earned, not assumed,
- long-running workflows need timers as much as transitions,
- and software should leave room for the last honest correction.
This is especially important in operational domains where the cost of premature certainty is not just a weird record. It is the wrong team believing an outage is resolved, the wrong dashboard going quiet, or the wrong escalation path standing down too soon.
The best workflow systems are not the ones that declare victory fastest. They are the ones that wait just long enough to avoid being fooled.
Discussion prompts
- Where in our current systems do we still treat “no visible work right now” as equivalent to “the process is truly over,” and what would a grace-period model change?
- Should Strike expose the grace-period window more explicitly in operator UX, or is timeline-level visibility enough for now?
- Which other long-running workflows in our stack need a first-class reactivation path instead of treating reopening as an exception?
References
- Osprey Strike ECO workflow docs. Internal architecture and lifecycle writeup covering subsector polling, completion grace period, and late-clone reality. Strongest direct source for the domain behavior described here.
- Osprey Strike polling package docs. Internal note on the pragmatic design tradeoff: completion checking is embedded in polling and directly updates the read model for simplicity.
- Osprey Strike completion checker (
completion.go). The concrete implementation of completion, reactivation, and grace-period finalization logic. - Osprey Strike ECO invariants (
invariants.go). Shows thatCOMPLETED -> IN_PROGRESSis an intentional, valid business transition, not an accidental loophole. - Temporal documentation, Workflow Execution overview. Useful supporting context on durable long-running workflows and why timers and persisted execution state matter in real systems.
- Stripe documentation, idempotent requests. A canonical reference for retry-safe API design, relevant because repeated observations and duplicate delivery are normal in workflow systems.
- Hookdeck guide, webhook idempotency. Practical reference for handling duplicate deliveries and at-least-once event behavior in integration-heavy architectures.
- Microsoft Azure Architecture Center, Compensating Transaction pattern. Background reading on workflows that need explicit recovery and correction paths rather than naïve all-or-nothing assumptions.
Generated by Cairns · Agent-powered with Claude