Cairn · Mar 21, 2026

Events All the Way Down

Event sourcing and CQRS in Osprey Strike — the architecture, the migration, and the escape hatch · ~18 min read · Suggested by Bob engineerarchitect

Most systems store what things are. Strike stores what happened — every state change as an immutable fact in an append-only log. This buys you a perfect audit trail, temporal queries, and event replay. It costs you cognitive load, hiring difficulty, and an architectural complexity budget that has to be justified every quarter. This cairn explains both sides of that ledger.

Before reading this

The Shape of the System

State as a Sequence of Facts

Traditional applications store the current state of an entity. You update a row, the old values are gone. If you want to know what changed, you add an updated_at column and hope for the best. Osprey Strike does something fundamentally different: it stores every change that has ever happened to an entity as an immutable event in an append-only log.

The current state isn’t stored — it’s derived by replaying the event stream from the beginning.

graph LR
    subgraph Traditional CRUD
        R[ECO Record<br/>status: COMPLETED<br/>updated: now]
    end

    subgraph Event Sourced
        E1[ECOCreated<br/>Jan 1 10:00] --> E2[RenderTasked<br/>Jan 1 10:02]
        E2 --> E3[PagerDispatched<br/>Jan 1 10:15]
        E3 --> E4[StatusCompleted<br/>Jan 2 14:00]
    end

    E4 -.->|replay derives| CS[Current State]

This is event sourcing. Instead of asking “what is the ECO?” you can ask “what happened to the ECO?” — and you get the full answer, with timestamps, actor IDs, and the exact sequence of state transitions.

Definition

Event Sourcing: A persistence pattern where state changes are stored as a sequence of immutable events rather than as mutable row updates. The current state of an entity is computed by replaying all its events from the beginning (or from a snapshot). The event log is append-only — events are never modified or deleted.

Why does this matter for emergency callout management? Because fiber construction and emergency response operate in a regulated space where the question isn’t just “what’s the status?” but “who changed the status, when, and what was it before?” An ECO that went from OPEN to COMPLETED should have a traceable chain: who dispatched the crew, when the contractor accepted, what the timeline looked like. With event sourcing, that audit trail isn’t a bolt-on — it’s the data model itself. The practical value of this goes beyond compliance. When a pager run fails at 3 AM and the on-call person needs to understand what happened, the event stream is the definitive record. No logs to correlate, no timestamps to cross-reference — just the sequence of facts.

The trade-off is real. Event sourcing adds conceptual overhead: developers need to think in terms of events rather than state mutations, aggregates need to be rehydrated from event streams, and eventual consistency becomes a design constraint rather than an edge case. The rest of this cairn maps both sides of that ledger.

The Migration: Kotlin/Axon to Go/Watermill

Strike wasn’t always built this way. The original implementation used Kotlin on Spring Boot with Axon Framework — a mature Java ecosystem for event sourcing and CQRS. Axon provides distributed event processing, tracking processors, saga persistence, and multi-node coordination out of the box. It’s powerful infrastructure.

It was also dramatically over-provisioned for Strike’s actual workload.

An analysis of the Kotlin codebase found that roughly 70% of Axon Framework’s capabilities went unused. Distributed processors, multi-node coordination, tracking token management, saga serialization — all present in the runtime, none required by the business logic. The actual domain was straightforward API orchestration: create an ECO, push to Render, call a pager list. The framework was solving problems the system didn’t have. ADR-001 documents this analysis in detail. The table of Axon features vs. actual usage is illuminating — event sourcing and aggregate rehydration were used, but tracking processors, distributed sagas, snapshots, multi-node coordination, and event upcasting were all either unused or unnecessary at Strike’s scale.

The migration to Go with Watermill and PostgreSQL was driven by four concerns:

Concern	Kotlin/Axon	Go/Watermill
Operational overhead	Axon Server as separate service, JVM memory (512MB–1GB per pod), 15–60s startup	Single binary, 50–100MB memory, 1–2s startup
Cognitive load	Spring auto-configuration, Axon annotation magic (`@Saga`, `@EventSourcingHandler`), implicit state persistence	Explicit wiring, plain functions, queryable state tables
Team handoff risk	Kotlin + Arrow FP + Axon = niche hiring pool	Go + PostgreSQL = mainstream skills
Unused capacity	~70% of Axon features unused	Architecture right-sized for actual workload

The key architectural decision: keep event sourcing and CQRS, replace the framework. Watermill provides lightweight message routing without the framework magic. PostgreSQL replaces Axon Server as the event store — which means events, projections, and process manager state all live in one database. No separate event store infrastructure to operate, monitor, or back up.

Key Takeaway

The migration preserved the architectural patterns (event sourcing, CQRS, transactional outbox, process managers) while eliminating the framework complexity. The domain logic was a rewrite; the architecture was a transplant.

This was a deliberate bet. Watermill’s pluggable architecture means the event store backend can be swapped to KurrentDB or Kafka later if PostgreSQL proves insufficient. But at Strike’s current scale — low event volume, single-instance deployment — PostgreSQL polling at 100ms intervals is more than adequate, and it removes an entire infrastructure dependency.

The CQRS Write Path

When a NOC operator creates an Emergency Callout, the request follows a pipeline that separates the act of writing from the act of reading. This is CQRS — Command Query Responsibility Segregation — and understanding the write path is essential to understanding how Strike works.

flowchart LR
    GQL[GraphQL Mutation] --> CB[Command Bus]
    CB --> CH[Command Handler]
    CH --> AGG[Aggregate]
    AGG --> ES[Event Store]
    ES --> OB[Outbox]

    subgraph Single Transaction
        ES
        OB
    end

The flow, step by step:

GraphQL mutation — The resolver receives the request and constructs a command: CreateECOCommand with the job number, geometry, description, and tenant context.
Command bus — Watermill routes the command to the appropriate handler by topic name (commands.CreateECOCommand).
Command handler — Loads the aggregate from the event store (replaying all previous events to reconstruct current state), passes the command to the aggregate’s Handle() method.
Aggregate — Validates business rules. If valid, returns one or more events. If invalid, returns an error. The aggregate never mutates state during command handling — it only produces events.
Event store + outbox — The events are appended to the events table and copied to the event_outbox table in the same database transaction. Both succeed or both fail.

Here’s what the command handler looks like in practice:

func (h *Handler) HandleCreateECO(ctx context.Context, cmd *CreateECOCommand) error {
    agg := h.loadAggregate(ctx, cmd.ID)

    events, err := agg.Handle(cmd)
    if err != nil {
        return err // Business rule violation
    }

    return h.store.Append(ctx, cmd.ID, agg.Version(), events)
}

The GraphQL resolver gets a response as soon as the events are durably stored. It doesn’t wait for projections to update, subscribers to be notified, or the pager to start ringing. The write path’s job is to validate and persist facts. Everything else happens asynchronously.

The CQRS Read Path

The read path is a different world entirely. Queries never touch the event store. They read from projection tables — denormalized SQL tables optimized for exactly the queries the UI needs.

flowchart LR
    OBP[Outbox Publisher] --> EB[Event Bus]
    EB --> PH[Projection Handlers]
    PH --> EV[eco_views]
    PH --> PV[pager_run_views]

    GQL[GraphQL Query] --> REPO[Repository]
    REPO --> EV
    REPO --> PV

The outbox publisher polls for unpublished events every 100ms. When it finds them, it publishes to Watermill’s in-memory event bus. Projection handlers subscribe to specific event types and update the view tables:

ECO projection — Listens for ECOCreatedEvent, ECOUpdatedEvent, ECORenderIntegrationStatusUpdatedEvent, and ECOStatusUpdatedEvent. Maintains the eco_views table with all the fields the UI needs: job number, status, render integration status, timestamps, tenant IDs.
PagerRun projection — Listens for PagerRunStartedEvent, PagerEventRecordedEvent, and PagerRunCompletedEvent. Maintains the pager_run_views table with the full paging timeline, outcome, and error details.

When a user queries their ECO list, the GraphQL resolver calls the repository, which runs a SELECT against eco_views. No event replay, no aggregate loading — just a SQL query with appropriate indexes and tenant filters.

Tip

The separation of reads and writes means you can optimize each independently. The write model is normalized around aggregates and business invariants. The read model is denormalized around query patterns. If the UI needs a new view — say, “show me all ECOs where paging timed out” — you add a projection handler and a new view table. The write path doesn’t change at all.

The gap between writing an event and seeing it in a query is typically under 100ms. In practice, it’s imperceptible. But the decoupling is architecturally significant: if a projection handler crashes, events aren’t lost — they’re still in the outbox, waiting to be published. You fix the handler, redeploy, and the projection catches up. This is eventual consistency in its most pragmatic form. The system never lies — it might briefly be behind, but it will always converge. And the event log is the ground truth that makes convergence provable.

Aggregates: ECO and PagerRun

Aggregates are the enforcement boundary for business rules. In Strike, there are two aggregate roots: ECO (Emergency Callout Order) and PagerRun (a single pager dispatch attempt). Each aggregate is loaded by replaying its events, validates commands against its current state, and emits new events if the command is valid.

Here’s the aggregate interface pattern:

type Aggregate struct {
    state   *State  // Current computed state
    version int     // Number of events applied
}

// Handle validates a command and returns events
func (agg *Aggregate) Handle(cmd any) ([]any, error) { ... }

// Apply updates state from an event (called during replay)
func (agg *Aggregate) Apply(event any) error { ... }

The ECO aggregate enforces invariants like: you can’t update a completed ECO, job types must be CMR/MFR/FNOL, render status transitions must follow a valid state machine (PENDING → TASKED → DISPATCHED, with FAILED reachable from any state).

stateDiagram-v2
    [*] --> Created: ECOCreatedEvent
    Created --> RenderTasked: RenderStatusUpdated(TASKED)
    RenderTasked --> RenderDispatched: RenderStatusUpdated(DISPATCHED)
    Created --> RenderFailed: RenderStatusUpdated(FAILED)
    RenderTasked --> RenderFailed: RenderStatusUpdated(FAILED)
    Created --> Updated: ECOUpdatedEvent
    Updated --> Updated: ECOUpdatedEvent
    Updated --> StatusCompleted: StatusUpdated(COMPLETED)
    Created --> StatusCompleted: StatusUpdated(COMPLETED)
    RenderDispatched --> StatusCompleted: StatusUpdated(COMPLETED)

Optimistic concurrency prevents conflicting writes. Every event in the store has a version number within its aggregate, and a UNIQUE(aggregate_id, version) constraint in PostgreSQL. If two requests load the same aggregate at version 2 and both try to write version 3, one succeeds and the other gets a version conflict error. The failing request must reload and retry.

// From outbox/store.go — the version check
currentVersion, err := store.getCurrentVersion(ctx, tx, aggregateID)
if currentVersion != expectedVersion {
    return eventstore.ErrVersionConflict
}

Key Takeaway

The aggregate is the consistency boundary. Within an aggregate, state transitions are atomic and validated. Between aggregates, coordination happens through events and process managers — never through shared mutable state.

Events themselves are simple Go structs with JSON serialization. Every event carries an ActorID for audit trail purposes and a Timestamp for temporal ordering:

type CreatedEvent struct {
    ID          string   `json:"id"`
    ActorID     string   `json:"actor_id"`
    NocTenantID string   `json:"noc_tenant_id"`
    OspTenantID string   `json:"osp_tenant_id"`
    JobNumber   int      `json:"job_number"`
    Geometry    Geometry `json:"geometry"`
    JobType     string   `json:"job_type"`
    Description string   `json:"description"`
    Attachments []string `json:"attachments"`
    Timestamp   int64    `json:"timestamp"`
}

The Transactional Outbox

Here’s a problem that looks simple until you think about failure modes. You need to save an event and publish it to the message bus. Two operations. What if the first succeeds and the second fails?

flowchart TD
    CMD[Command Handler] --> TX{Begin Transaction}
    TX --> INS1[INSERT into events]
    TX --> INS2[INSERT into event_outbox]
    TX --> COMMIT[COMMIT]

    POLL[Background Publisher<br/>polls every 100ms] --> FETCH[SELECT FROM event_outbox<br/>WHERE published_at IS NULL]
    FETCH --> PUB[Publish to Watermill]
    PUB --> MARK[SET published_at = NOW]
    PUB -->|failure| INC[Increment attempts]
    INC -->|max attempts| DLQ[Move to dead letter queue]

If you save the event to the database and then publish to the message bus, and the publish fails, you have an event that exists but was never delivered. Subscribers miss it. The projection never updates. The pager never fires.

The transactional outbox pattern solves this by writing both the event and a copy of it in an outbox table within the same database transaction. Both writes succeed or both fail — PostgreSQL’s ACID guarantees handle this. A background publisher then polls the outbox for unpublished entries and delivers them to the Watermill event bus.

// From outbox/store.go — the atomic write
func (store *EventStore) Append(ctx context.Context,
    aggregateID uuid.UUID, expectedVersion int,
    events []eventstore.Event) error {

    tx, err := store.db.Begin(ctx)
    defer tx.Rollback(ctx)

    // Version check (optimistic concurrency)
    currentVersion, _ := store.getCurrentVersion(ctx, tx, aggregateID)
    if currentVersion != expectedVersion {
        return eventstore.ErrVersionConflict
    }

    for _, evt := range events {
        eventID, _ := store.insertEvent(ctx, tx, aggregateID, version, evt)
        store.insertOutboxEntry(ctx, tx, eventID, aggregateID, evt)
    }

    return tx.Commit(ctx)
}

The delivery guarantee is at-least-once. If the publisher crashes after publishing but before marking the entry as published, the entry will be published again on restart. This means handlers must be idempotent — they need to tolerate seeing the same event twice without producing incorrect results. The outbox table also has a dead letter queue (event_outbox_dlq). After a configurable number of failed publish attempts, entries are moved to the DLQ for manual review rather than blocking subsequent events. This prevents a single poison message from stalling the entire pipeline.

Warning

At-least-once delivery means every event handler must be idempotent. The pager process manager checks for existing state before starting a new run. The ECO projection uses upserts. If you add a new handler, design it to handle duplicate delivery from day one.

Process Managers

Some workflows span multiple aggregates and involve external systems with unpredictable timing. The pager dispatch flow is the canonical example: send a page via Twilio, wait for a webhook callback, handle timeouts, retry the next contact. This is too complex for a single aggregate — it requires a process manager.

Strike’s process managers use PostgreSQL state tables and background workers instead of distributed saga frameworks. The trade-off is deliberate: you lose built-in compensation orchestration, but you gain debuggability. When paging fails at 3 AM, SELECT * FROM pager_process_state WHERE eco_id = '...' tells you exactly where the workflow stopped and why.

The pager process manager’s lifecycle:

stateDiagram-v2
    [*] --> PENDING: ECO becomes TASKED
    PENDING --> RUNNING: StartPagerRunCommand sent
    RUNNING --> RUNNING: Activity extends deadline
    RUNNING --> COMPLETED: Contractor ACCEPTED
    RUNNING --> COMPLETED: All contacts EXHAUSTED
    RUNNING --> COMPLETED: Deadline exceeded (TIMEOUT)

The process manager listens for RenderIntegrationStatusUpdatedEvent. When an ECO reaches TASKED status, it:

Resolves the geographic region from the ECO’s geometry
Loads the contact list for that region (from the OSP tenant’s pager lists)
Creates a pager_process_state record with status PENDING and a deadline
Sends a StartPagerRunCommand to the PagerRun aggregate
Calls the Twilio API to initiate the call flow

type State struct {
    PagerRunID  string
    ECOID       string
    Status      ProcessStatus  // PENDING → RUNNING → COMPLETED
    StartedAt   time.Time
    DeadlineAt  time.Time      // Background worker checks this
    LastEventAt time.Time
    Version     int            // Optimistic locking
}

A background worker polls every 5 seconds for overdue processes (WHERE status = 'RUNNING' AND deadline_at < NOW()). When a deadline is exceeded, the worker marks the run as failed. Each webhook callback from Twilio extends the deadline — so active calls keep the process alive.

Tip

Process manager state uses optimistic locking with a version column, just like aggregates. If a webhook callback and the deadline worker both try to update the same process simultaneously, one gets a version conflict and must retry. This prevents lost updates without database-level row locks.

Idempotency is baked in at every entry point. The handler for RenderIntegrationStatusUpdatedEvent checks for an existing pager run before starting a new one — so if Watermill delivers the event twice, the second delivery is a no-op:

existingState, err := proc.stateRepo.FindByECOID(ctx, evt.ID)
if err == nil && existingState != nil {
    return nil // Already processing — skip duplicate
}

Snapshots and Upcasters

Two mechanical concerns become important as an event-sourced system ages: performance degradation from long event streams, and schema evolution when event structures change.

Snapshots solve the performance problem. Without them, loading an aggregate means replaying every event from the beginning. For a long-lived ECO with dozens of status transitions, that’s a lot of events. Strike uses an EventCountPolicy with a threshold of 50: after every 50 new events, the aggregate’s current state is serialized and saved as a snapshot. On next load, the system loads the snapshot and replays only events after it.

type EventCountPolicy struct {
    Threshold int  // Default: 50
}

func (p *EventCountPolicy) ShouldSnapshot(currentVersion, snapshotVersion int) bool {
    return (currentVersion - snapshotVersion) >= p.Threshold
}

The snapshot store uses a simple snapshots table with UPSERT semantics — each aggregate has at most one snapshot, and it’s replaced when a new one is created. The Snapshotable interface requires two methods: CreateSnapshot() to serialize state and ApplySnapshot() to restore it. At Strike’s current scale, snapshots are mostly insurance. ECO aggregates typically accumulate fewer than 20 events. But the infrastructure is wired and tested, so when aggregates do get long (process managers with many retries, for instance), the performance cliff is already handled.

Upcasters solve schema evolution. When the structure of an event changes — say, CreatedEvent gains a new required field — old events in the store still have the old schema. Rather than migrating historical data, Strike uses an UpcasterRegistry that transforms event payloads from old schema versions to the current version during event loading.

type Upcaster interface {
    FromVersion() int
    ToVersion() int
    EventType() string
    Upcast(payload []byte) ([]byte, error)
}

Upcasters chain automatically. If you have upcasters for v1→v2 and v2→v3, loading a v1 event runs both transformations in sequence. Every event in the store carries a event_schema_version field, and the registry uses it to determine which transformations to apply.

Key Takeaway

Events are immutable, but their interpretation evolves. Upcasters are the mechanism that lets the code move forward without rewriting history. The event store remains the source of truth; upcasters are a read-time lens that translates old facts into the current language.

The Phase 2 Escape Hatch

Here’s the part most architecture documents leave out: the documented plan for when this might be too much.

The architecture simplification roadmap defines a Phase 2 option: simplify from CQRS/Event Sourcing to straightforward CRUD. Not because CQRS is wrong, but because the team acknowledged — in writing — that the complexity cost might exceed the benefit for a pre-product-market-fit application with a small team.

The decision framework is explicit:

Trigger	Action
New developers take >3 weeks to onboard	Revisit Phase 2
CQRS-related bugs exceed 20% of total issues	Revisit Phase 2
Product direction shifts significantly	Revisit Phase 2
Team is comfortable, bugs are normal	Keep CQRS
Event replay or time travel gets used	Keep CQRS

What you’d lose: the complete event stream (replaced by an audit_log table with PostgreSQL triggers), temporal queries, event replay for rebuilding read models, and the ability to add new projections without touching the write path.

What you’d gain: 54% code reduction (from ~11,900 to ~5,500 lines), strong consistency instead of eventual consistency, halved onboarding time (1–2 weeks instead of 2–3), and a dramatically simpler mental model — standard Go services with SQL queries. The Phase 2 design is fully specified, including database schema, service layer code, and migration steps. It’s not a hand-wave — it’s a concrete plan that can be executed in weeks, not months. The architecture simplification roadmap includes working code samples for every major component.

The current recommendation, after two rounds of architecture review: defer Phase 2. The Go/Watermill port achieved the primary operational goals — no Axon Server, 10x memory reduction, mainstream language. The CQRS patterns are well-documented, the test coverage is strong (86%), and the cognitive load concern is mitigated by the architecture guide and cairns like this one.

Warning

The escape hatch is not a criticism of the current architecture. It’s an acknowledgment that architectural decisions exist in context — team size, product maturity, hiring market — and that context changes. Documenting the off-ramp is a sign of engineering maturity, not architectural doubt.

The optionality is preserved because of a Watermill design decision: the message bus backend is pluggable. If the team decides to simplify to CRUD, the migration path is incremental — run both systems in parallel, migrate service by service, then remove the CQRS infrastructure. If the team decides to scale up, the same pluggability lets you swap PostgreSQL polling for Kafka or NATS without rewriting application code.

Summary

Strike uses event sourcing — state changes are stored as immutable events in an append-only PostgreSQL table. Current state is derived by replaying events through aggregates. This provides a complete audit trail, temporal queries, and event replay at the cost of increased architectural complexity.
The system migrated from Kotlin/Axon to Go/Watermill because ~70% of Axon's capabilities went unused, JVM overhead was significant, and the Kotlin/Axon hiring pool was too narrow for a small team handoff. The migration preserved the patterns while replacing the framework.
CQRS separates writes (commands → aggregates → events → outbox) from reads (projections → view tables → queries). The transactional outbox guarantees at-least-once delivery by writing events and outbox entries in the same database transaction, with a background publisher polling every 100ms.
ECO and PagerRun are aggregate roots that enforce business invariants and use optimistic concurrency (version numbers + unique constraints) to prevent conflicting writes. Aggregates produce events but never mutate state during command handling.
Process managers coordinate multi-step workflows using PostgreSQL state tables and background workers instead of distributed saga frameworks. The trade-off: manual idempotency and compensation logic in exchange for direct SQL debuggability.
Snapshots (50-event threshold) optimize aggregate loading for long-lived entities. Upcasters handle event schema evolution by transforming old payloads to the current format at read time, without modifying historical data.
A documented Phase 2 escape hatch exists to simplify to CRUD if CQRS proves too costly. The decision framework specifies concrete triggers for revisiting. Current recommendation: defer — the Go port achieved operational goals and the patterns are well-documented.

Discussion Prompts

The transactional outbox polls at 100ms intervals. What failure scenarios does this protect against that an in-process event publish wouldn't? When would you accept the simpler approach?
The Phase 2 escape hatch replaces the event stream with a PostgreSQL audit trigger. What information would you lose in that translation, and would it matter for this domain?
Process managers use PostgreSQL state tables instead of a saga framework. At what team size or system complexity would you reconsider that trade-off?

References

Martin Fowler: Event Sourcing — Canonical reference for the event sourcing pattern and its trade-offs.
Martin Fowler: CQRS — Command Query Responsibility Segregation explained in its original context.
Watermill — Go library for event-driven applications, providing the command bus, event bus, and CQRS infrastructure used in Strike.
Transactional Outbox Pattern — Chris Richardson's pattern description for reliable event publishing without distributed transactions.
Event Store: Event Sourcing Explained — Practical introduction to event sourcing from the team behind EventStoreDB.
ADR-001: Backend Technology Stack Migration — Internal decision record documenting the Kotlin/Axon → Go/Watermill rationale and alternatives analysis.
Architecture Simplification Roadmap — Internal document specifying Phase 1 (Go port) and Phase 2 (CRUD simplification) with decision framework and migration steps.

Generated by Cairns · Agent-powered with Claude

← Back to Trailhead