An ECO Is Born

Everything starts with a NOC operator and a geometry. Someone detects an outage — maybe an alarm fires, maybe a customer calls, maybe a monitoring system flags a loss-of-light condition — and the operator creates an Emergency Callout Order in Strike’s web UI.

The creation form captures surprisingly little data, and that’s deliberate. In an emergency, you want the fastest path from “something is broken” to “someone is on their way.” The ECO captures:

  • Location — a GeoJSON Point or LineString stored as PostGIS geometry. This is the incident site, and it drives everything that follows: which contractor gets paged, which Render project receives the task, which pager list applies.
  • Job type — CMR (Claims Management Resource), MFR (Maintenance Field Request), or FNOL (First Notice of Loss). These categories determine downstream billing and reporting.
  • Description — free text from the operator describing what’s known about the outage.
  • Tenant context — the NOC tenant creating the ECO and the OSP tenant who will work it. Every ECO exists at the intersection of these two entities.
Definition

ECO as Aggregate Root: In event-sourced terms, the ECO is an aggregate — a consistency boundary that accepts commands and emits events. The CreateECOCommand is validated by the aggregate (geometry must be valid, job type must be recognized, description can’t be empty), and if it passes, a single ECOCreatedEvent is appended to the event store. That event is the ECO’s birth certificate — immutable, timestamped, and carrying the full initial state.

The moment that event hits the outbox, the system wakes up. The transactional outbox publisher — polling every 100ms — picks up the event and fans it out to three subscribers simultaneously: the ECO projection (which creates the read model for the UI), the WebSocket publisher (which notifies connected browsers), and the Render integration processor (which starts the external integration chain).

The ECO’s initial status is OPEN. Its render integration status is PENDING. Neither will stay that way for long.

Reaching the Field: Render Integration

The Render integration processor is the first process manager to act on a new ECO. It’s a single-shot workflow: receive the ECOCreatedEvent, create a task in the OSP’s Render Networks tenant, and report success or failure. Render Networks is the field workforce management platform that OSP contractors use to manage their crews. Strike doesn’t replace Render — it creates tasks in Render and polls for status changes. Field technicians never interact with Strike directly.

The processor looks up the OSP tenant’s Render instance from the database — each OSP has its own API credentials, project name, and default task type. It creates a ClientFactory that handles OAuth2 token management with singleflight to prevent token stampedes when multiple ECOs are being processed simultaneously.

The task creation request carries the ECO’s geometry, description, and a critical piece of metadata: the subsector name. This is the ECO’s identifier in Render’s world — formatted as ECO-{job_id} (e.g., ECO-E2-00524-2025). Every task created for this ECO, whether the initial investigation or follow-on work created by field technicians, shares this subsector. It’s the grouping mechanism that lets Strike track an entire ECO’s work as a unit.

sequenceDiagram
    participant ES as Event Store
    participant RI as Render Integration<br/>Processor
    participant CF as Client Factory
    participant RA as Render API

    ES->>RI: ECOCreatedEvent
    RI->>RI: Look up OSP's Render instance
    RI->>CF: GetClient(instanceID)
    CF->>CF: OAuth2 token (singleflight)
    CF-->>RI: Authenticated client
    RI->>RA: POST /tasks (subsector, geometry, description)
    RA-->>RI: 201 Created (projectTaskID)
    RI->>ES: UpdateRenderStatusCommand(TASKED)

If the Render API call succeeds, the processor emits an UpdateECORenderStatusCommand with status TASKED, carrying the project task ID back from Render. If it fails — network error, authentication failure, missing Render instance — the status becomes FAILED with an error message, and the NOC sees an error state prompting escalation.

Key Takeaway

The render integration status (PENDING → TASKED → DISPATCHED → FAILED) is an internal tracking status, not shown to NOC operators. It exists for system health monitoring and debugging. On success, tasks just appear. On failure, operators see a simplified error state.

Pager Dispatch: Getting Humans Moving

Creating a task in Render puts work into the system. But tasks don’t move themselves — someone needs to answer the phone. The pager process manager is the orchestrator that turns a TASKED ECO into an acknowledged field dispatch.

When the ECO’s render integration status changes to TASKED, the pager processor wakes up. Its first job is geographic: using the ECO’s PostGIS geometry, it resolves the pager region — a polygon that maps to a contact list of OSP supervisors responsible for that area. Each OSP tenant maintains its own regional pager lists, so the same physical location can route to different people depending on which contractor is assigned.

The processor builds a contact list, generates a message ("NOC: ECO 524 - Fiber cut on Main St"), and fires a StartPagerRunCommand to create the PagerRun aggregate. Then it calls the Twilio API to initiate the first phone call.

stateDiagram-v2
    [*] --> PENDING: Process state created
    PENDING --> RUNNING: Twilio call initiated
    RUNNING --> RUNNING: Deadline extended on activity
    RUNNING --> COMPLETED: Accepted / Exhausted / Failed
    COMPLETED --> [*]

    note right of RUNNING
        Background worker checks
        deadline every 5 seconds.
        Timeout triggers failure.
    end note

Here’s where the design gets interesting. The pager process manager doesn’t hold state in memory. It writes a pager_process_state row to PostgreSQL with a deadline_at timestamp, then returns. A background worker polls every 5 seconds for overdue deadlines. When a Twilio webhook arrives (the contractor answered, declined, or didn’t pick up), the webhook handler updates the process state and extends the deadline. This is a deliberate architectural choice. Traditional saga frameworks keep process state in memory or in framework-managed storage. Strike uses plain PostgreSQL rows because when paging fails at 3 AM, you want SELECT * FROM pager_process_state WHERE eco_id = '...', not framework log archaeology. The debuggability wins justified the extra implementation work.

The full pager dispatch sequence looks like this:

sequenceDiagram
    participant PP as Pager Processor
    participant DB as pager_process_state
    participant TW as Twilio
    participant SUP as OSP Supervisor
    participant BW as Background Worker

    PP->>DB: INSERT state (PENDING, deadline=now+30s)
    PP->>TW: POST /calls (contact #1)
    TW->>SUP: Phone call
    
    alt Supervisor accepts
        SUP->>TW: Keypress accept
        TW->>PP: Webhook: ACCEPTED
        PP->>DB: UPDATE status=COMPLETED, outcome=DISPATCHED
    else No answer (timeout)
        BW->>DB: SELECT WHERE deadline < NOW()
        BW->>PP: Timeout detected
        PP->>TW: POST /calls (contact #2)
        PP->>DB: UPDATE deadline=now+30s
    else All contacts exhausted
        PP->>DB: UPDATE status=COMPLETED, outcome=EXHAUSTED
    end

Each webhook event — CALLING, ACCEPTED, DECLINED, FAILED — is recorded in the PagerRun aggregate’s timeline, building a complete record of every attempt. The process state uses optimistic locking (a version column) to handle the race between webhook delivery and the background timeout worker. If both try to update simultaneously, one gets a version conflict and retries.

Warning

The pager processor handles duplicate events explicitly. Watermill may deliver the same event more than once, so the processor checks for an existing pager run before starting a new one. Idempotency isn’t optional — it’s a correctness requirement.

When the pager run completes with a DISPATCHED outcome, the dispatch processor updates the ECO’s render integration status to DISPATCHED. The NOC operator, watching the live status, sees the ECO transition from “page sent” to “contractor dispatched.” They don’t see the intermediate TASKED status — that’s internal plumbing.

Watching the Work: Render Polling

Once a contractor is dispatched, Strike’s job shifts from orchestration to observation. The Render polling service runs a background loop every 60 seconds, querying the Render API for all ECOs with active integrations.

The polling service queries eco_views for ECOs in a pollable state — dispatched but not yet hard-completed. For each ECO, it fetches the tasks in the ECO’s subsector from the Render API, using the same ClientFactory and per-tenant credentials as the integration processor.

Change detection uses fingerprinting. Each task’s relevant fields are hashed into an MD5 fingerprint. The service caches fingerprints in memory (sync.Map) and only publishes change events when a fingerprint differs from the cached version. This means the service can poll hundreds of ECOs every minute without generating noise — only actual field activity produces events.

flowchart TD
    Start[Poll cycle starts] --> Query[Query ECOs with active polling]
    Query --> Loop{For each ECO}
    Loop --> Fetch[Fetch tasks from Render API<br/>by subsector name]
    Fetch --> Compare{Fingerprint<br/>changed?}
    Compare -->|No| Skip[Skip — no changes]
    Compare -->|Yes| Publish[Publish task change event]
    Publish --> Check{All tasks<br/>completed?}
    Check -->|No| Loop
    Check -->|Yes| Grace[Start completion grace period]
    Skip --> Loop
    Loop -->|Done| Wait[Wait 60 seconds]
    Wait --> Start
    Grace --> Loop

The polling service also detects late clones — when a field technician creates a new task from a completed investigation. It tracks task counts per ECO and publishes a late-clone event when the count increases, which can revert an ECO from COMPLETED back to IN_PROGRESS. Late clones are a real operational pattern. A field tech completes the investigation, reports the damage, and then creates follow-on tasks for the actual repair work. The investigation might complete before the follow-on tasks even exist. Without the grace period and late-clone detection, Strike would prematurely close ECOs.

The service includes resilience features — exponential backoff on database errors, per-ECO locking to prevent concurrent read-modify-write races on fingerprint caches, and sync.Once guards for first-poll initialization. These aren’t premature optimization — they’re responses to production patterns where a single Render API hiccup shouldn’t cascade into missed updates for every ECO in the system.

Three Layers of Status

Strike maintains three distinct status models, and understanding why is essential to understanding the system’s design philosophy.

Layer 1: Event-sourced state (the truth). The ECO aggregate’s status — OPEN, IN_PROGRESS, COMPLETED — is derived by replaying its event stream. This is the authoritative record. When a StatusUpdatedEvent is applied, the aggregate validates the transition (you can’t go from COMPLETED back to OPEN without explicit intervention) and updates its internal state.

Layer 2: Render task status (external system state). Each task in Render has its own lifecycle: blueprinted → allocated → releasable → released → completed. These statuses reflect Render’s internal workflow — contractor assignment, crew dispatch, work execution. Strike polls these but doesn’t display them directly.

Layer 3: NOC display status (human-friendly view). This is what operators actually see. The display status maps Render’s six-state task lifecycle down to five human-meaningful states:

Display Status Meaning Maps From
Pending Work identified, not yet assigned blueprinted, pending, tasked
Assigned Allocated to a tech or crew allocated, releasable, released
In Progress Active work happening released (after assignment)
Blocked Problem preventing progress jeopardy
Complete Work finished completed
Key Takeaway

The three-layer status model exists because operators and systems have different needs. NOC operators care about “is someone working on this?” — not whether Render’s internal state is allocated vs. releasable. By separating internal tracking from display, Strike can change Render integration details without touching the NOC UI, and vice versa.

The render integration status — PENDING, TASKED, DISPATCHED, FAILED — lives alongside these as a fourth, internal-only status. It’s a backend gate confirming the Render handoff succeeded. Operators never see it. When integration succeeds, tasks simply appear. When it fails, they see a simplified error prompting escalation.

This separation costs something: there are now multiple status fields to maintain, mapping logic at the API boundary, and documentation to keep synchronized. The payoff is that each audience gets exactly the information they need, training new NOC staff doesn’t require explaining Render’s internals, and the system is decoupled from Render’s implementation details.

The Mock Server: A Simulation Engine

You can’t develop an emergency response system by waiting for emergencies. The mock server — a Go application in packages/mock/ — simulates both the Render Networks API and Twilio’s call infrastructure, providing a complete end-to-end development environment.

The mock server exposes three surfaces: a Render API mock at /render/* that simulates task creation and status progression, a Twilio API mock at /twilio/* that receives call requests and sends webhook callbacks, and a control dashboard at /dashboard for selecting scenarios and watching execution in real time.

The scenario system is preset-based. You select a scenario, and the mock walks through the entire integration sequence with realistic timing:

Scenario Render Behavior Pager Behavior
happy-path Normal integration (3s) First contact accepts (2s)
render-fails Returns error No pager run started
pager-declined Normal First contact declines
pager-exhausted Normal All contacts decline
pager-timeout Normal No response (triggers timeout)
slow-render 10s delays Normal
second-accepts Normal First declines, second accepts

Each scenario runs in its own goroutine, keyed by ECO ID. The orchestration engine manages state transitions independently per ECO — you can have one ECO running happy-path while another runs pager-exhausted, simultaneously. State is in-memory (acceptable for a dev tool), protected by mutex, and visible through the dashboard’s polling UI. The mock server replaced a Python script for Render simulation and a Kotlin mock class for Twilio. Unifying them into a single Go service enabled cross-integration scenario orchestration — something that wasn’t possible when the mocks lived in separate languages and processes.

Docker Compose profiles make this practical for daily development. The principle is “start what you’re not touching” — if you’re working on the API, run just docker for-api and Docker starts PostgreSQL and the mock server. If you’re working on the frontend, just docker for-web starts everything except the frontend. The mock server’s port (8090) matches the old Python mock, so the API’s environment variables just work.

Tip

The mock’s embedded dashboard uses Tailwind CSS via CDN and Alpine.js — no build step, no npm, no bundler. It’s a dev tool that refuses to become a second frontend project. The dashboard polls /control/status every second for live updates, which is fine for a tool that runs on localhost.

Completion and the Audit Trail

An ECO completes when all tasks in its subsector reach completed status in Render. But “complete” isn’t as simple as it sounds — the system needs to account for late-arriving work.

When the polling service detects that every task in the subsector is done, it marks the ECO as COMPLETED and starts a grace period (default: 24 hours). During this period, polling continues. If a new task appears — a field technician cloned the investigation to create follow-on work — the ECO reverts to IN_PROGRESS and the NOC is notified. If a completed task is reopened, same thing.

After the grace period expires without new activity, the ECO reaches hard completion. Polling stops permanently. The ECO is finalized. Any changes after this point require manual intervention — the system won’t automatically reactivate a hard-completed ECO.

stateDiagram-v2
    [*] --> OPEN: ECO created
    OPEN --> IN_PROGRESS: Field tech dispatched
    IN_PROGRESS --> COMPLETED: All tasks done
    COMPLETED --> IN_PROGRESS: Late clone or task reopened
    COMPLETED --> HARD_COMPLETE: Grace period expires (24h)
    HARD_COMPLETE --> [*]

    note right of COMPLETED
        Polling continues
        during grace period
    end note

What remains after completion is the event stream — and this is where event sourcing pays its largest dividend. Every ECO carries a complete, immutable record of everything that happened:

  1. ECOCreatedEvent — who created it, when, with what geometry and description
  2. RenderIntegrationStatusUpdatedEvent — when the Render task was created, which instance, the project task ID
  3. PagerRun events — every call attempt, every response, the timeline of human acknowledgment
  4. StatusUpdatedEvent — each transition, with timestamps and actor IDs
  5. Task change events — every field update detected by polling, including late clones

This isn’t a log file that rotates away or a database column that gets overwritten. It’s the data model itself. When a regulatory question arises — “who was notified about this outage, and how quickly did the response happen?” — the answer is a query against the event store, not a forensic reconstruction from scattered logs. The audit trail matters beyond compliance. Post-incident reviews can replay the exact sequence of events to identify bottlenecks — was the delay in pager acknowledgment? In field dispatch? In the completion grace period? The event stream is the single source of truth for “what actually happened.”

Key Takeaway

The event stream as audit trail isn’t a feature bolted onto the system — it’s a consequence of the architecture. When you store state as events, the audit trail is free. When you store state as mutable rows, the audit trail is an additional system you have to build, maintain, and trust.

Putting It All Together

The full ECO lifecycle is a sequence of handoffs between process managers, each responsible for one phase of the workflow:

Time Event System Action ECO Status
T+0 NOC creates ECO ECOCreatedEvent emitted OPEN
T+1m Render integration processor Creates task in Render, status → TASKED OPEN
T+1m Pager processor triggers Starts pager run, calls Twilio OPEN
T+5m Supervisor accepts page Webhook: ACCEPTED, status → DISPATCHED OPEN
T+10m Supervisor assigns tech Render task → released IN_PROGRESS
T+30m Tech arrives, investigates Task updates detected by polling IN_PROGRESS
T+35m Tech clones follow-on task Late clone detected, new task tracked IN_PROGRESS
T+2h Investigation completed Task → completed IN_PROGRESS
T+4h Follow-on completed All tasks done → grace period starts COMPLETED
T+28h Grace period expires Polling stops, ECO finalized HARD COMPLETE

Each row in this timeline is backed by an event in the store. The process managers don’t communicate directly — they react to events and emit commands. The Render integration processor doesn’t know about the pager processor. The pager processor doesn’t know about the polling service. They’re connected by the event bus, and the event store is the shared record of what happened.

This is the design that lets a single button press — “Create ECO” — cascade into a multi-hour, multi-system, multi-human workflow, while every participant (human and machine) sees exactly the information they need, exactly when they need it.

  1. An ECO starts as a CreateECOCommand carrying geometry, job type, description, and dual tenant context. The aggregate validates and emits a CreatedEvent.
  2. The Render integration processor creates a task in the OSP's Render tenant, using subsector naming to group all work for the ECO. Success moves the integration status to TASKED.
  3. The pager process manager resolves contacts by region (PostGIS), calls via Twilio, and tracks state in a PostgreSQL table — not in memory. A background worker handles timeouts.
  4. The polling service watches Render every 60 seconds, using fingerprint-based change detection to minimize noise. It handles late clones and completion grace periods.
  5. Three status layers serve different audiences: event-sourced state for correctness, Render task status for integration tracking, and display status for NOC operators.
  6. The mock server simulates both Render and Twilio with 7 preset scenarios, goroutine-per-ECO orchestration, and an embedded dashboard — enabling full lifecycle testing without external dependencies.
  7. The event stream is the audit trail. Every state change, every pager attempt, every field update is an immutable fact in the event store — regulatory compliance as an architectural consequence.
  • The grace period defaults to 24 hours. What operational patterns might justify making this shorter or longer? What signals could the system use to set it dynamically?
  • The pager process manager uses PostgreSQL for state instead of a saga framework. At what scale or complexity would this trade-off start to hurt, and what would migration to a framework look like?
  • The three-layer status model adds mapping complexity. Could a simpler two-layer model (internal + display) work, or does the Render integration layer earn its existence?
  1. When the Fiber Goes Dark — The domain context that explains why ECOs exist and what OSP emergency response looks like.
  2. The Shape of the System — Mono-repo structure, package boundaries, and the request flow that this cairn builds on.
  3. Events All the Way Down — The event sourcing architecture that makes the ECO lifecycle possible, including the Watermill migration.
  4. Render Networks — Field workforce management platform. External service that OSP contractors use for task management and crew dispatch.
  5. Twilio — Communications API used for pager dispatch. Strike uses Twilio Studio flows for the call sequence and webhooks for status callbacks.