Distributed systems people love clean terminal states. COMPLETED feels good. It closes dashboards, quiets alerts, and lets everyone move on.

Field operations are less polite.

A technician finishes the investigation task, then clones a repair task ten minutes later. A supervisor reopens a task because the original closure was premature. A polling loop catches a system where every visible task is complete, but the reality on the ground is still changing. If the platform treats the first moment of apparent completion as final truth, it will lie to operators at exactly the moment they most need accurate status.

That is why Osprey Strike’s ECO workflow does something subtle and important. It allows an ECO to become COMPLETED, but only as a soft completion during a grace period. During that window, polling continues. New tasks can reactivate the ECO. Reopened tasks can reactivate the ECO. Only after the grace period expires without new activity does the system earn a hard stop.

This is one of those design choices that looks like implementation detail until you imagine the alternative. Then it becomes obvious.

“All tasks are completed” is an observation, not a verdict

The polling logic in Strike is refreshingly plain about what it knows. It checks the tasks associated with an ECO’s subsector, determines whether any tasks exist, and asks a narrow question: are they all currently completed?

That is useful, but it is not the same as asking whether the incident is over forever.

The internal completion checker makes the distinction explicit:

  • if all tasks are completed and the ECO is not already completed, mark it COMPLETED
  • if the ECO is completed and new tasks appear, move it back to IN_PROGRESS
  • if the ECO is completed and incomplete tasks reappear, move it back to IN_PROGRESS

That logic is small. The idea behind it is bigger. The domain invariants allow COMPLETED -> IN_PROGRESS specifically to support reactivation during the grace period. That is a strong sign that the behavior is a business rule, not an accidental side effect.

Key Takeaway

A field-work system should treat completion as a claim under observation, not a permanent fact discovered once.

If you collapse those ideas, you get false confidence. The UI says complete. The NOC relaxes. Then a follow-on task appears and everyone has to mentally unwind what the software prematurely declared settled.

Soft-complete is the honest state when the world may still move

The grace period exists because the workflow has inertia. Work in the field does not materialize as one neat atomic transaction. Investigation, follow-on work, reassignment, and verification often arrive in bursts.

Strike’s Render polling model keeps polling COMPLETED ECOs for a configurable period, twenty-four hours by default in the documented workflow. During that time, the ECO is done enough to communicate progress, but not done enough to abandon observation.

That gives the system a three-part progression instead of a brittle binary:

  1. Active work: OPEN or IN_PROGRESS, poll continuously.
  2. Soft-complete: COMPLETED, but still within grace period, keep polling.
  3. Hard-complete: grace period expired with no new activity, stop polling.

A simple timeline looks like this:

flowchart LR
  A[Tasks active] --> B[All visible tasks completed]
  B --> C[Mark ECO COMPLETED]
  C --> D[Grace period continues polling]
  D -->|New task appears| E[Reactivate to IN_PROGRESS]
  D -->|Task reopened| E
  D -->|No activity until timer expires| F[Hard complete, stop polling]

That middle state is doing the real work. It is where the system admits uncertainty without becoming useless.

Late clones are not edge cases, they are workflow reality

One of the most practical details in the internal docs is late-clone detection. The polling service tracks task counts and can distinguish between “the same finished set of tasks” and “a new task has now joined the subsector.”

That matters because a field technician may finish an investigation task before creating the follow-on repair task. Operationally, that is not bizarre. It is often exactly how the work happens. The investigation establishes what is wrong. Only then does the next piece of work become concrete enough to create.

If the platform used a naive rule like “first moment everything is completed equals final completion,” that ordinary workflow would look like a bug. The software would oscillate between certainty and embarrassment.

Instead, Strike makes room for the real sequence:

sequenceDiagram
  participant N as NOC
  participant S as Strike
  participant R as Render
  participant F as Field tech

  N->>S: Create ECO
  S->>R: Create investigation task
  F->>R: Complete investigation
  S->>S: Detect all current tasks complete
  S->>S: Mark COMPLETED, keep polling in grace period
  F->>R: Clone follow-on repair task
  S->>S: Detect new task during grace period
  S->>S: Reactivate ECO to IN_PROGRESS
  F->>R: Complete repair
  S->>S: Mark COMPLETED again
  S->>S: Grace period expires with no change

This is a better model because it reflects the shape of work rather than forcing work to impersonate software neatness.

A lot of bad systems reactivate awkwardly. They smuggle reopened work through hidden flags, comments, or ad hoc exceptions because the original state machine assumed completion was irreversible.

Strike does the healthier thing. It makes reactivation a valid status transition.

The domain invariants explicitly allow:

  • OPEN -> IN_PROGRESS -> COMPLETED as the normal path
  • COMPLETED -> IN_PROGRESS when a task is reopened or a new task appears during grace period

That choice matters for three reasons.

First, it keeps the domain model honest. If reopened work is real, it belongs in the state machine.

Second, it improves debugging. You can reason about why the status changed because the system emits a reason like:

  • all tasks in subsector completed
  • new task detected during grace period
  • task reopened during grace period

Third, it prevents people from inventing shadow status models in spreadsheets, chat threads, or tribal memory.

Tip

If your operators routinely say things like “ignore the green check, it’s actually still live,” the software has already lost the status-model argument.

A reactivation path is not a concession to messiness. It is how a disciplined system handles a world that changes after the first apparent ending.

The real design pattern is not saga magic, it is durable observation

The polling package documentation says something I appreciate: this completion logic is a pragmatic compromise rather than a grand workflow-engine abstraction.

Instead of introducing a separate process manager framework for every nuance, Strike embeds the completion checker inside the polling service and directly updates the read model. The docs are candid that this is less pure than a fully event-choreographed design, but simpler for the current problem.

That tradeoff is worth noticing because teams often over-romanticize orchestration patterns.

The valuable pattern here is not “use a fancy saga framework.” The valuable pattern is:

  • keep durable state,
  • keep timers,
  • tolerate retries and duplicates,
  • observe for long enough to catch late truth,
  • and make status transitions explicit.

That same philosophy shows up elsewhere in the system. The pager process manager uses database-backed process state and deadlines rather than ephemeral in-memory orchestration. The ECO completion checker uses a timer-shaped grace period rather than pretending the first completed snapshot is final.

Different mechanism, same instinct: durable observation beats brittle optimism. This is one place where official workflow-engine literature is helpful as supporting context. Systems like Temporal emphasize durable workflow execution and timers because long-running work rarely behaves like a single request-response function. Strike reaches a similar conclusion with plainer building blocks.

Idempotency is not optional when completion can be revisited

As soon as a system allows late events, reopened work, and repeated polling, it has to become serious about duplicate handling.

That is not theory. It is table stakes.

A polling cycle may see the same completed task set multiple times. A webhook or message bus may deliver an event more than once. A task list may be unchanged for several polls and then change in a way that looks suspiciously like an older state returning.

The response cannot be panic. It has to be idempotency.

Strike’s architecture already leans this way. The pager process manager handles duplicate events explicitly. The completion checker treats same-state updates as no-ops where appropriate. External workflow literature from Stripe and Hookdeck makes the same broader point for API and webhook systems: retries are normal, so correctness depends on making repeated inputs safe.

In other words, the grace period pattern is not just about timers. It is about accepting that the same workflow may be observed, updated, and re-observed many times before it is truly quiet.

The operator experience depends on naming the states correctly

There is also a product lesson here. Internal truth and user-facing meaning are not always identical.

Strike already separates technical status from operator-friendly display status in other parts of the system. The same discipline applies here. If the software internally knows an ECO is in a grace-period completion window, the UX should avoid implying irreversible finality when the system itself is still watching for follow-on work.

That does not necessarily mean exposing a literal state named SOFT_COMPLETED. It does mean designing the operator experience so that people are not surprised when an apparently complete ECO becomes active again.

Useful patterns might include:

  • a subtle “monitoring for follow-on work” indicator,
  • timeline entries explaining reactivation reasons,
  • and language that distinguishes “all current tasks completed” from “case closed permanently.”

The right UI copy can save a lot of unnecessary human confusion.

The bigger lesson is about truth in long-running systems

The completion grace period is a small feature with a large worldview behind it.

That worldview says:

  • status is a model of reality, not reality itself,
  • terminal states should be earned, not assumed,
  • long-running workflows need timers as much as transitions,
  • and software should leave room for the last honest correction.

This is especially important in operational domains where the cost of premature certainty is not just a weird record. It is the wrong team believing an outage is resolved, the wrong dashboard going quiet, or the wrong escalation path standing down too soon.

The best workflow systems are not the ones that declare victory fastest. They are the ones that wait just long enough to avoid being fooled.

Discussion prompts

  1. Where in our current systems do we still treat “no visible work right now” as equivalent to “the process is truly over,” and what would a grace-period model change?
  2. Should Strike expose the grace-period window more explicitly in operator UX, or is timeline-level visibility enough for now?
  3. Which other long-running workflows in our stack need a first-class reactivation path instead of treating reopening as an exception?

References

  1. Osprey Strike ECO workflow docs. Internal architecture and lifecycle writeup covering subsector polling, completion grace period, and late-clone reality. Strongest direct source for the domain behavior described here.
  2. Osprey Strike polling package docs. Internal note on the pragmatic design tradeoff: completion checking is embedded in polling and directly updates the read model for simplicity.
  3. Osprey Strike completion checker (completion.go). The concrete implementation of completion, reactivation, and grace-period finalization logic.
  4. Osprey Strike ECO invariants (invariants.go). Shows that COMPLETED -> IN_PROGRESS is an intentional, valid business transition, not an accidental loophole.
  5. Temporal documentation, Workflow Execution overview. Useful supporting context on durable long-running workflows and why timers and persisted execution state matter in real systems.
  6. Stripe documentation, idempotent requests. A canonical reference for retry-safe API design, relevant because repeated observations and duplicate delivery are normal in workflow systems.
  7. Hookdeck guide, webhook idempotency. Practical reference for handling duplicate deliveries and at-least-once event behavior in integration-heavy architectures.
  8. Microsoft Azure Architecture Center, Compensating Transaction pattern. Background reading on workflows that need explicit recovery and correction paths rather than naïve all-or-nothing assumptions.