Events All the Way Down
Event sourcing and CQRS in Osprey Strike — the architecture, the migration, and the escape hatch · ~18 min read · Suggested by Bob engineerarchitect
Most systems store what things are. Strike stores what happened — every state change as an immutable fact in an append-only log. This buys you a perfect audit trail, temporal queries, and event replay. It costs you cognitive load, hiring difficulty, and an architectural complexity budget that has to be justified every quarter. This cairn explains both sides of that ledger.
State as a Sequence of Facts
Traditional applications store the current state of an entity. You update a row, the old values are gone. If you want to know what changed, you add an updated_at column and hope for the best. Osprey Strike does something fundamentally different: it stores every change that has ever happened to an entity as an immutable event in an append-only log.
The current state isn’t stored — it’s derived by replaying the event stream from the beginning.
graph LR
subgraph Traditional CRUD
R[ECO Record<br/>status: COMPLETED<br/>updated: now]
end
subgraph Event Sourced
E1[ECOCreated<br/>Jan 1 10:00] --> E2[RenderTasked<br/>Jan 1 10:02]
E2 --> E3[PagerDispatched<br/>Jan 1 10:15]
E3 --> E4[StatusCompleted<br/>Jan 2 14:00]
end
E4 -.->|replay derives| CS[Current State]
This is event sourcing. Instead of asking “what is the ECO?” you can ask “what happened to the ECO?” — and you get the full answer, with timestamps, actor IDs, and the exact sequence of state transitions.
Event Sourcing: A persistence pattern where state changes are stored as a sequence of immutable events rather than as mutable row updates. The current state of an entity is computed by replaying all its events from the beginning (or from a snapshot). The event log is append-only — events are never modified or deleted.
Why does this matter for emergency callout management? Because fiber construction and emergency response operate in a regulated space where the question isn’t just “what’s the status?” but “who changed the status, when, and what was it before?” An ECO that went from OPEN to COMPLETED should have a traceable chain: who dispatched the crew, when the contractor accepted, what the timeline looked like. With event sourcing, that audit trail isn’t a bolt-on — it’s the data model itself. The practical value of this goes beyond compliance. When a pager run fails at 3 AM and the on-call person needs to understand what happened, the event stream is the definitive record. No logs to correlate, no timestamps to cross-reference — just the sequence of facts.
The trade-off is real. Event sourcing adds conceptual overhead: developers need to think in terms of events rather than state mutations, aggregates need to be rehydrated from event streams, and eventual consistency becomes a design constraint rather than an edge case. The rest of this cairn maps both sides of that ledger.
The Migration: Kotlin/Axon to Go/Watermill
Strike wasn’t always built this way. The original implementation used Kotlin on Spring Boot with Axon Framework — a mature Java ecosystem for event sourcing and CQRS. Axon provides distributed event processing, tracking processors, saga persistence, and multi-node coordination out of the box. It’s powerful infrastructure.
It was also dramatically over-provisioned for Strike’s actual workload.
An analysis of the Kotlin codebase found that roughly 70% of Axon Framework’s capabilities went unused. Distributed processors, multi-node coordination, tracking token management, saga serialization — all present in the runtime, none required by the business logic. The actual domain was straightforward API orchestration: create an ECO, push to Render, call a pager list. The framework was solving problems the system didn’t have. ADR-001 documents this analysis in detail. The table of Axon features vs. actual usage is illuminating — event sourcing and aggregate rehydration were used, but tracking processors, distributed sagas, snapshots, multi-node coordination, and event upcasting were all either unused or unnecessary at Strike’s scale.
The migration to Go with Watermill and PostgreSQL was driven by four concerns:
| Concern | Kotlin/Axon | Go/Watermill |
|---|---|---|
| Operational overhead | Axon Server as separate service, JVM memory (512MB–1GB per pod), 15–60s startup | Single binary, 50–100MB memory, 1–2s startup |
| Cognitive load | Spring auto-configuration, Axon annotation magic (@Saga, @EventSourcingHandler), implicit state persistence |
Explicit wiring, plain functions, queryable state tables |
| Team handoff risk | Kotlin + Arrow FP + Axon = niche hiring pool | Go + PostgreSQL = mainstream skills |
| Unused capacity | ~70% of Axon features unused | Architecture right-sized for actual workload |
The key architectural decision: keep event sourcing and CQRS, replace the framework. Watermill provides lightweight message routing without the framework magic. PostgreSQL replaces Axon Server as the event store — which means events, projections, and process manager state all live in one database. No separate event store infrastructure to operate, monitor, or back up.
The migration preserved the architectural patterns (event sourcing, CQRS, transactional outbox, process managers) while eliminating the framework complexity. The domain logic was a rewrite; the architecture was a transplant.
This was a deliberate bet. Watermill’s pluggable architecture means the event store backend can be swapped to KurrentDB or Kafka later if PostgreSQL proves insufficient. But at Strike’s current scale — low event volume, single-instance deployment — PostgreSQL polling at 100ms intervals is more than adequate, and it removes an entire infrastructure dependency.
The CQRS Write Path
When a NOC operator creates an Emergency Callout, the request follows a pipeline that separates the act of writing from the act of reading. This is CQRS — Command Query Responsibility Segregation — and understanding the write path is essential to understanding how Strike works.
flowchart LR
GQL[GraphQL Mutation] --> CB[Command Bus]
CB --> CH[Command Handler]
CH --> AGG[Aggregate]
AGG --> ES[Event Store]
ES --> OB[Outbox]
subgraph Single Transaction
ES
OB
end
The flow, step by step:
-
GraphQL mutation — The resolver receives the request and constructs a command:
CreateECOCommandwith the job number, geometry, description, and tenant context. -
Command bus — Watermill routes the command to the appropriate handler by topic name (
commands.CreateECOCommand). -
Command handler — Loads the aggregate from the event store (replaying all previous events to reconstruct current state), passes the command to the aggregate’s
Handle()method. -
Aggregate — Validates business rules. If valid, returns one or more events. If invalid, returns an error. The aggregate never mutates state during command handling — it only produces events.
-
Event store + outbox — The events are appended to the
eventstable and copied to theevent_outboxtable in the same database transaction. Both succeed or both fail.
Here’s what the command handler looks like in practice:
func (h *Handler) HandleCreateECO(ctx context.Context, cmd *CreateECOCommand) error {
agg := h.loadAggregate(ctx, cmd.ID)
events, err := agg.Handle(cmd)
if err != nil {
return err // Business rule violation
}
return h.store.Append(ctx, cmd.ID, agg.Version(), events)
}
The GraphQL resolver gets a response as soon as the events are durably stored. It doesn’t wait for projections to update, subscribers to be notified, or the pager to start ringing. The write path’s job is to validate and persist facts. Everything else happens asynchronously.
The CQRS Read Path
The read path is a different world entirely. Queries never touch the event store. They read from projection tables — denormalized SQL tables optimized for exactly the queries the UI needs.
flowchart LR
OBP[Outbox Publisher] --> EB[Event Bus]
EB --> PH[Projection Handlers]
PH --> EV[eco_views]
PH --> PV[pager_run_views]
GQL[GraphQL Query] --> REPO[Repository]
REPO --> EV
REPO --> PV
The outbox publisher polls for unpublished events every 100ms. When it finds them, it publishes to Watermill’s in-memory event bus. Projection handlers subscribe to specific event types and update the view tables:
-
ECO projection — Listens for
ECOCreatedEvent,ECOUpdatedEvent,ECORenderIntegrationStatusUpdatedEvent, andECOStatusUpdatedEvent. Maintains theeco_viewstable with all the fields the UI needs: job number, status, render integration status, timestamps, tenant IDs. -
PagerRun projection — Listens for
PagerRunStartedEvent,PagerEventRecordedEvent, andPagerRunCompletedEvent. Maintains thepager_run_viewstable with the full paging timeline, outcome, and error details.
When a user queries their ECO list, the GraphQL resolver calls the repository, which runs a SELECT against eco_views. No event replay, no aggregate loading — just a SQL query with appropriate indexes and tenant filters.
The separation of reads and writes means you can optimize each independently. The write model is normalized around aggregates and business invariants. The read model is denormalized around query patterns. If the UI needs a new view — say, “show me all ECOs where paging timed out” — you add a projection handler and a new view table. The write path doesn’t change at all.
The gap between writing an event and seeing it in a query is typically under 100ms. In practice, it’s imperceptible. But the decoupling is architecturally significant: if a projection handler crashes, events aren’t lost — they’re still in the outbox, waiting to be published. You fix the handler, redeploy, and the projection catches up. This is eventual consistency in its most pragmatic form. The system never lies — it might briefly be behind, but it will always converge. And the event log is the ground truth that makes convergence provable.
Aggregates: ECO and PagerRun
Aggregates are the enforcement boundary for business rules. In Strike, there are two aggregate roots: ECO (Emergency Callout Order) and PagerRun (a single pager dispatch attempt). Each aggregate is loaded by replaying its events, validates commands against its current state, and emits new events if the command is valid.
Here’s the aggregate interface pattern:
type Aggregate struct {
state *State // Current computed state
version int // Number of events applied
}
// Handle validates a command and returns events
func (agg *Aggregate) Handle(cmd any) ([]any, error) { ... }
// Apply updates state from an event (called during replay)
func (agg *Aggregate) Apply(event any) error { ... }
The ECO aggregate enforces invariants like: you can’t update a completed ECO, job types must be CMR/MFR/FNOL, render status transitions must follow a valid state machine (PENDING → TASKED → DISPATCHED, with FAILED reachable from any state).
stateDiagram-v2
[*] --> Created: ECOCreatedEvent
Created --> RenderTasked: RenderStatusUpdated(TASKED)
RenderTasked --> RenderDispatched: RenderStatusUpdated(DISPATCHED)
Created --> RenderFailed: RenderStatusUpdated(FAILED)
RenderTasked --> RenderFailed: RenderStatusUpdated(FAILED)
Created --> Updated: ECOUpdatedEvent
Updated --> Updated: ECOUpdatedEvent
Updated --> StatusCompleted: StatusUpdated(COMPLETED)
Created --> StatusCompleted: StatusUpdated(COMPLETED)
RenderDispatched --> StatusCompleted: StatusUpdated(COMPLETED)
Optimistic concurrency prevents conflicting writes. Every event in the store has a version number within its aggregate, and a UNIQUE(aggregate_id, version) constraint in PostgreSQL. If two requests load the same aggregate at version 2 and both try to write version 3, one succeeds and the other gets a version conflict error. The failing request must reload and retry.
// From outbox/store.go — the version check
currentVersion, err := store.getCurrentVersion(ctx, tx, aggregateID)
if currentVersion != expectedVersion {
return eventstore.ErrVersionConflict
}
The aggregate is the consistency boundary. Within an aggregate, state transitions are atomic and validated. Between aggregates, coordination happens through events and process managers — never through shared mutable state.
Events themselves are simple Go structs with JSON serialization. Every event carries an ActorID for audit trail purposes and a Timestamp for temporal ordering:
type CreatedEvent struct {
ID string `json:"id"`
ActorID string `json:"actor_id"`
NocTenantID string `json:"noc_tenant_id"`
OspTenantID string `json:"osp_tenant_id"`
JobNumber int `json:"job_number"`
Geometry Geometry `json:"geometry"`
JobType string `json:"job_type"`
Description string `json:"description"`
Attachments []string `json:"attachments"`
Timestamp int64 `json:"timestamp"`
}
The Transactional Outbox
Here’s a problem that looks simple until you think about failure modes. You need to save an event and publish it to the message bus. Two operations. What if the first succeeds and the second fails?
flowchart TD
CMD[Command Handler] --> TX{Begin Transaction}
TX --> INS1[INSERT into events]
TX --> INS2[INSERT into event_outbox]
TX --> COMMIT[COMMIT]
POLL[Background Publisher<br/>polls every 100ms] --> FETCH[SELECT FROM event_outbox<br/>WHERE published_at IS NULL]
FETCH --> PUB[Publish to Watermill]
PUB --> MARK[SET published_at = NOW]
PUB -->|failure| INC[Increment attempts]
INC -->|max attempts| DLQ[Move to dead letter queue]
If you save the event to the database and then publish to the message bus, and the publish fails, you have an event that exists but was never delivered. Subscribers miss it. The projection never updates. The pager never fires.
The transactional outbox pattern solves this by writing both the event and a copy of it in an outbox table within the same database transaction. Both writes succeed or both fail — PostgreSQL’s ACID guarantees handle this. A background publisher then polls the outbox for unpublished entries and delivers them to the Watermill event bus.
// From outbox/store.go — the atomic write
func (store *EventStore) Append(ctx context.Context,
aggregateID uuid.UUID, expectedVersion int,
events []eventstore.Event) error {
tx, err := store.db.Begin(ctx)
defer tx.Rollback(ctx)
// Version check (optimistic concurrency)
currentVersion, _ := store.getCurrentVersion(ctx, tx, aggregateID)
if currentVersion != expectedVersion {
return eventstore.ErrVersionConflict
}
for _, evt := range events {
eventID, _ := store.insertEvent(ctx, tx, aggregateID, version, evt)
store.insertOutboxEntry(ctx, tx, eventID, aggregateID, evt)
}
return tx.Commit(ctx)
}
The delivery guarantee is at-least-once. If the publisher crashes after publishing but before marking the entry as published, the entry will be published again on restart. This means handlers must be idempotent — they need to tolerate seeing the same event twice without producing incorrect results.
The outbox table also has a dead letter queue (event_outbox_dlq). After a configurable number of failed publish attempts, entries are moved to the DLQ for manual review rather than blocking subsequent events. This prevents a single poison message from stalling the entire pipeline.
At-least-once delivery means every event handler must be idempotent. The pager process manager checks for existing state before starting a new run. The ECO projection uses upserts. If you add a new handler, design it to handle duplicate delivery from day one.
Process Managers
Some workflows span multiple aggregates and involve external systems with unpredictable timing. The pager dispatch flow is the canonical example: send a page via Twilio, wait for a webhook callback, handle timeouts, retry the next contact. This is too complex for a single aggregate — it requires a process manager.
Strike’s process managers use PostgreSQL state tables and background workers instead of distributed saga frameworks. The trade-off is deliberate: you lose built-in compensation orchestration, but you gain debuggability. When paging fails at 3 AM, SELECT * FROM pager_process_state WHERE eco_id = '...' tells you exactly where the workflow stopped and why.
The pager process manager’s lifecycle:
stateDiagram-v2
[*] --> PENDING: ECO becomes TASKED
PENDING --> RUNNING: StartPagerRunCommand sent
RUNNING --> RUNNING: Activity extends deadline
RUNNING --> COMPLETED: Contractor ACCEPTED
RUNNING --> COMPLETED: All contacts EXHAUSTED
RUNNING --> COMPLETED: Deadline exceeded (TIMEOUT)
The process manager listens for RenderIntegrationStatusUpdatedEvent. When an ECO reaches TASKED status, it:
- Resolves the geographic region from the ECO’s geometry
- Loads the contact list for that region (from the OSP tenant’s pager lists)
- Creates a
pager_process_staterecord with status PENDING and a deadline - Sends a
StartPagerRunCommandto the PagerRun aggregate - Calls the Twilio API to initiate the call flow
type State struct {
PagerRunID string
ECOID string
Status ProcessStatus // PENDING → RUNNING → COMPLETED
StartedAt time.Time
DeadlineAt time.Time // Background worker checks this
LastEventAt time.Time
Version int // Optimistic locking
}
A background worker polls every 5 seconds for overdue processes (WHERE status = 'RUNNING' AND deadline_at < NOW()). When a deadline is exceeded, the worker marks the run as failed. Each webhook callback from Twilio extends the deadline — so active calls keep the process alive.
Process manager state uses optimistic locking with a version column, just like aggregates. If a webhook callback and the deadline worker both try to update the same process simultaneously, one gets a version conflict and must retry. This prevents lost updates without database-level row locks.
Idempotency is baked in at every entry point. The handler for RenderIntegrationStatusUpdatedEvent checks for an existing pager run before starting a new one — so if Watermill delivers the event twice, the second delivery is a no-op:
existingState, err := proc.stateRepo.FindByECOID(ctx, evt.ID)
if err == nil && existingState != nil {
return nil // Already processing — skip duplicate
}
Snapshots and Upcasters
Two mechanical concerns become important as an event-sourced system ages: performance degradation from long event streams, and schema evolution when event structures change.
Snapshots solve the performance problem. Without them, loading an aggregate means replaying every event from the beginning. For a long-lived ECO with dozens of status transitions, that’s a lot of events. Strike uses an EventCountPolicy with a threshold of 50: after every 50 new events, the aggregate’s current state is serialized and saved as a snapshot. On next load, the system loads the snapshot and replays only events after it.
type EventCountPolicy struct {
Threshold int // Default: 50
}
func (p *EventCountPolicy) ShouldSnapshot(currentVersion, snapshotVersion int) bool {
return (currentVersion - snapshotVersion) >= p.Threshold
}
The snapshot store uses a simple snapshots table with UPSERT semantics — each aggregate has at most one snapshot, and it’s replaced when a new one is created. The Snapshotable interface requires two methods: CreateSnapshot() to serialize state and ApplySnapshot() to restore it.
At Strike’s current scale, snapshots are mostly insurance. ECO aggregates typically accumulate fewer than 20 events. But the infrastructure is wired and tested, so when aggregates do get long (process managers with many retries, for instance), the performance cliff is already handled.
Upcasters solve schema evolution. When the structure of an event changes — say, CreatedEvent gains a new required field — old events in the store still have the old schema. Rather than migrating historical data, Strike uses an UpcasterRegistry that transforms event payloads from old schema versions to the current version during event loading.
type Upcaster interface {
FromVersion() int
ToVersion() int
EventType() string
Upcast(payload []byte) ([]byte, error)
}
Upcasters chain automatically. If you have upcasters for v1→v2 and v2→v3, loading a v1 event runs both transformations in sequence. Every event in the store carries a event_schema_version field, and the registry uses it to determine which transformations to apply.
Events are immutable, but their interpretation evolves. Upcasters are the mechanism that lets the code move forward without rewriting history. The event store remains the source of truth; upcasters are a read-time lens that translates old facts into the current language.
The Phase 2 Escape Hatch
Here’s the part most architecture documents leave out: the documented plan for when this might be too much.
The architecture simplification roadmap defines a Phase 2 option: simplify from CQRS/Event Sourcing to straightforward CRUD. Not because CQRS is wrong, but because the team acknowledged — in writing — that the complexity cost might exceed the benefit for a pre-product-market-fit application with a small team.
The decision framework is explicit:
| Trigger | Action |
|---|---|
| New developers take >3 weeks to onboard | Revisit Phase 2 |
| CQRS-related bugs exceed 20% of total issues | Revisit Phase 2 |
| Product direction shifts significantly | Revisit Phase 2 |
| Team is comfortable, bugs are normal | Keep CQRS |
| Event replay or time travel gets used | Keep CQRS |
What you’d lose: the complete event stream (replaced by an audit_log table with PostgreSQL triggers), temporal queries, event replay for rebuilding read models, and the ability to add new projections without touching the write path.
What you’d gain: 54% code reduction (from ~11,900 to ~5,500 lines), strong consistency instead of eventual consistency, halved onboarding time (1–2 weeks instead of 2–3), and a dramatically simpler mental model — standard Go services with SQL queries. The Phase 2 design is fully specified, including database schema, service layer code, and migration steps. It’s not a hand-wave — it’s a concrete plan that can be executed in weeks, not months. The architecture simplification roadmap includes working code samples for every major component.
The current recommendation, after two rounds of architecture review: defer Phase 2. The Go/Watermill port achieved the primary operational goals — no Axon Server, 10x memory reduction, mainstream language. The CQRS patterns are well-documented, the test coverage is strong (86%), and the cognitive load concern is mitigated by the architecture guide and cairns like this one.
The escape hatch is not a criticism of the current architecture. It’s an acknowledgment that architectural decisions exist in context — team size, product maturity, hiring market — and that context changes. Documenting the off-ramp is a sign of engineering maturity, not architectural doubt.
The optionality is preserved because of a Watermill design decision: the message bus backend is pluggable. If the team decides to simplify to CRUD, the migration path is incremental — run both systems in parallel, migrate service by service, then remove the CQRS infrastructure. If the team decides to scale up, the same pluggability lets you swap PostgreSQL polling for Kafka or NATS without rewriting application code.
Summary
- Strike uses event sourcing — state changes are stored as immutable events in an append-only PostgreSQL table. Current state is derived by replaying events through aggregates. This provides a complete audit trail, temporal queries, and event replay at the cost of increased architectural complexity.
- The system migrated from Kotlin/Axon to Go/Watermill because ~70% of Axon's capabilities went unused, JVM overhead was significant, and the Kotlin/Axon hiring pool was too narrow for a small team handoff. The migration preserved the patterns while replacing the framework.
- CQRS separates writes (commands → aggregates → events → outbox) from reads (projections → view tables → queries). The transactional outbox guarantees at-least-once delivery by writing events and outbox entries in the same database transaction, with a background publisher polling every 100ms.
- ECO and PagerRun are aggregate roots that enforce business invariants and use optimistic concurrency (version numbers + unique constraints) to prevent conflicting writes. Aggregates produce events but never mutate state during command handling.
- Process managers coordinate multi-step workflows using PostgreSQL state tables and background workers instead of distributed saga frameworks. The trade-off: manual idempotency and compensation logic in exchange for direct SQL debuggability.
- Snapshots (50-event threshold) optimize aggregate loading for long-lived entities. Upcasters handle event schema evolution by transforming old payloads to the current format at read time, without modifying historical data.
- A documented Phase 2 escape hatch exists to simplify to CRUD if CQRS proves too costly. The decision framework specifies concrete triggers for revisiting. Current recommendation: defer — the Go port achieved operational goals and the patterns are well-documented.
Discussion Prompts
- The transactional outbox polls at 100ms intervals. What failure scenarios does this protect against that an in-process event publish wouldn't? When would you accept the simpler approach?
- The Phase 2 escape hatch replaces the event stream with a PostgreSQL audit trigger. What information would you lose in that translation, and would it matter for this domain?
- Process managers use PostgreSQL state tables instead of a saga framework. At what team size or system complexity would you reconsider that trade-off?
References
- Martin Fowler: Event Sourcing — Canonical reference for the event sourcing pattern and its trade-offs.
- Martin Fowler: CQRS — Command Query Responsibility Segregation explained in its original context.
- Watermill — Go library for event-driven applications, providing the command bus, event bus, and CQRS infrastructure used in Strike.
- Transactional Outbox Pattern — Chris Richardson's pattern description for reliable event publishing without distributed transactions.
- Event Store: Event Sourcing Explained — Practical introduction to event sourcing from the team behind EventStoreDB.
- ADR-001: Backend Technology Stack Migration — Internal decision record documenting the Kotlin/Axon → Go/Watermill rationale and alternatives analysis.
- Architecture Simplification Roadmap — Internal document specifying Phase 1 (Go port) and Phase 2 (CRUD simplification) with decision framework and migration steps.
Generated by Cairns · Agent-powered with Claude