Cairn · Mar 21, 2026

The Shape of the System

How Osprey Strike is structured, from mono-repo to multi-tenant · ~16 min read · Suggested by Bob pmengineer

Before you can understand why a system behaves the way it does, you need to see how it's put together. Osprey Strike isn't a single application — it's four packages in a mono-repo, connected by event streams and separated by tenant boundaries. This cairn maps the architecture from directory structure to request flow.

Before reading this

When the Fiber Goes Dark

Four Packages, One Repository

Osprey Strike lives in a mono-repo with four packages, each with its own language, build system, and deployment artifact. They share a root justfile for orchestration, but they’re otherwise independent — you can build and test any package without touching the others.

graph TD
    Root["osprey-strike/"]
    Root --> API["packages/api/
    Go GraphQL API"]
    Root --> Web["packages/web/
    Next.js Frontend"]
    Root --> Mock["packages/mock/
    Go Mock Server"]
    Root --> MCP["packages/mcp/
    MCP Server"]

    API --- DB[(PostgreSQL 18
    + PostGIS)]
    Mock -.->|simulates| RN[Render Networks]
    Mock -.->|simulates| TW[Twilio]
    Web -->|GraphQL| API
    API -->|webhooks| Mock

The packages break down like this:

packages/api/ is the core. A Go application built on Watermill for messaging and gqlgen for GraphQL. This is where event sourcing, CQRS, command handlers, process managers, and projections all live. If Strike has a heart, it’s here. The API package alone contains the complete domain model, event store, transactional outbox, projection handlers, and all external integration logic. Later cairns in this trail will dissect each of these individually.

packages/web/ is the user-facing frontend. A Next.js application using TypeScript, React, Tailwind CSS, and HeroUI components. It provides the ECO management interface — creating callouts, monitoring status, switching between tenant contexts, and viewing geographic data on maps.

packages/mock/ is the development mock server. A Go application that simulates Render Networks and Twilio APIs for local development. It runs an orchestrated scenario system — you pick a scenario (happy path, render failure, pager timeout), and the mock server walks through the entire integration sequence with realistic timing. The mock server replaces what used to be a Python script for Render simulation and a Kotlin mock class for Twilio. Unifying both into a single Go service enabled cross-integration scenario orchestration — something that wasn’t possible when the mocks were separate.

packages/mcp/ is an experimental AI integration prototype. A Model Context Protocol server that exposes Strike operations to AI agents. Still early — the other three packages are where the production system lives.

Key Takeaway

The mono-repo isn’t a monolith. Each package deploys independently, with its own Docker image and CI pipeline. The mono-repo structure buys you atomic cross-package changes and a single place to reason about the whole system.

How a Request Flows Through the System

Understanding Strike means understanding the path a request takes from a user clicking “Create ECO” to the data appearing on everyone’s screen. The system uses CQRS — Command Query Responsibility Segregation — which means writes and reads take fundamentally different paths.

Here’s the write path when a NOC operator creates an Emergency Callout:

sequenceDiagram
    participant Browser
    participant Web as Next.js
    participant GQL as GraphQL API
    participant CB as Command Bus
    participant Agg as ECO Aggregate
    participant ES as Event Store
    participant OB as Outbox Publisher
    participant Proj as Projection
    participant Views as eco_views

    Browser->>Web: Create ECO form submit
    Web->>GQL: mutation createECO(input)
    GQL->>CB: Send CreateECOCommand
    CB->>Agg: Route to ECO handler
    Agg->>Agg: Validate business rules
    Agg->>ES: Append ECOCreatedEvent
    Note over ES: events + outbox in<br/>single DB transaction
    ES-->>GQL: Command accepted
    GQL-->>Web: Return ECO ID
    Web-->>Browser: Navigate to ECO detail

    OB->>ES: Poll for unpublished events
    ES-->>OB: ECOCreatedEvent
    OB->>Proj: Publish to event bus
    Proj->>Views: INSERT into eco_views
    Note over Views: Read model now<br/>reflects the new ECO

The critical insight is the split between acknowledgment and projection. The browser gets a response as soon as the event is durably stored. The read model updates asynchronously — typically within 100ms, but the write path doesn’t wait for it. This is the “eventual” in eventual consistency, and it’s a deliberate architectural choice that buys scalability and reliability. In practice, the latency between event storage and projection update is imperceptible to users. The outbox publisher polls every 100ms, and the projection handler runs in-process. But the decoupling matters for correctness — it guarantees events are never lost, even if the projection handler crashes.

The read path is simpler. Queries go directly to the projection tables — eco_views and pager_run_views — which are standard SQL tables optimized for the exact queries the UI needs. No event replay, no aggregate loading, just a SELECT.

Definition

CQRS (Command Query Responsibility Segregation): A pattern that separates the write model (commands → events) from the read model (projections → queries). Writes go through aggregates and the event store; reads go directly to denormalized view tables. The event bus connects the two sides.

The Two-Dimensional Tenant Problem

Most multi-tenant systems have one kind of tenant. Osprey Strike has two — and they interact. This is arguably the most structurally interesting problem in the system.

As we covered in When the Fiber Goes Dark, the ECO lifecycle involves two distinct organizations: the NOC (Network Operations Center) that detects the outage and creates the callout, and the OSP (Outside Plant) contractor that dispatches crews and performs the repair. These are different companies, with different users, different data, and different needs.

graph TD
    subgraph NOC Tenants
        N1[Metro NOC]
        N2[County NOC]
        N3[Rural NOC]
    end

    subgraph OSP Tenants
        O1[FieldWork OSP]
        O2[QuickFix OSP]
        O3[Regional OSP]
    end

    N1 -->|active link| O1
    N1 -->|active link| O2
    N2 -->|active link| O1
    N3 -->|active link| O3

    subgraph OSP Resources
        O1 --- R1[Render Instance]
        O1 --- P1[Pager Lists]
        O1 --- J1[Job Sequence]
        O2 --- R2[Render Instance]
        O2 --- P2[Pager Lists]
        O2 --- J2[Job Sequence]
    end

Each tenant type owns different things:

NOC tenants create ECOs and need visibility into progress. They have Twilio subaccounts for pager phone numbers. They choose which OSP contractor to dispatch.
OSP tenants receive dispatch and do the field work. They own Render API credentials, pager contact lists (organized by geographic region), and job number sequences.

The relationship between NOCs and OSPs is many-to-many. A NOC can work with multiple OSP contractors (dispatching to whoever covers that geography), and an OSP can serve multiple NOC clients. These relationships are tracked in a noc_osp_links table with their own lifecycle — pending, active, or suspended.

Warning

Every ECO has dual tenant context: a noc_tenant_id (who created it) and an osp_tenant_id (who works it). Both are required. This means ECO visibility rules aren’t a simple “show me my tenant’s data” — a NOC sees ECOs they created, an OSP sees ECOs assigned to them, and the same ECO appears in both views.

The data isolation model is row-level, enforced in the repository layer. Every query includes a tenant filter derived from the authenticated user’s active tenant context. Users can belong to multiple tenants (NOC and/or OSP), and they select which tenant they’re operating as via the UI — that selection flows through as an X-Tenant-ID header on every API request.

One subtle but important design choice: ECOs store direct tenant references rather than joining through the noc_osp_links table. This means if a NOC-OSP relationship is later suspended or severed, historical ECOs remain visible to both parties. The link table controls future ECO creation, not historical visibility.

External Integrations

Strike doesn’t operate in isolation. It connects to external systems for field task management, pager dispatch, and authentication. Each integration has its own architectural character.

Render Networks

Render Networks is the field task management platform that OSP contractors use to coordinate crews, track work orders, and manage field operations. Strike integrates with Render to push ECO details into the contractor’s existing workflow.

The integration is polling-based with fingerprint change detection. When an ECO is created and integrated with Render, Strike creates a project and tasks in the Render API. It then polls for status changes — not on a fixed timer, but using a fingerprint comparison that detects when task data has actually changed. This avoids hammering the Render API with redundant requests while still catching updates promptly. Each OSP tenant has its own Render API credentials, stored as envelope-encrypted secrets in the database. The polling service uses a client factory with TTL-based caching and singleflight deduplication — so concurrent requests for the same tenant’s client don’t stampede the token endpoint.

The polling service runs a configurable worker pool. Each Render instance (per OSP tenant) has its own polling interval, inactivity TTL, and grace period. When all tasks for an ECO reach a terminal state, polling enters a grace period before stopping entirely. Stale detection catches ECOs that go quiet without completing.

Twilio

Twilio handles pager dispatch — the automated process of calling through an OSP’s contact list to find someone who can respond to the emergency. Strike sends call requests to the Twilio API and receives webhook callbacks as the call progresses.

The pager workflow is coordinated by a process manager: a state machine backed by a PostgreSQL table (pager_process_state) with a background worker that checks for timeouts. When a pager run starts, it calls the first person on the list. If they don’t answer within 30 seconds, it moves to the next. If everyone declines or times out, the run is marked as exhausted.

Tip

The pager process manager deliberately avoids distributed saga frameworks. Instead, it uses a PostgreSQL state table that you can query with SELECT * FROM pager_process_state WHERE eco_id = '...'. When paging fails at 3 AM, you want psql, not framework log archaeology.

Authentication

Authentication follows a dual-mode pattern:

Development: Keycloak runs as a Docker Compose service, pre-loaded with a strike-dev realm and test users. An OpenResty reverse proxy sits in front, performing OIDC authentication and injecting ALB-style headers — mimicking the production flow locally.
Production: Cloudflare Access handles the first authentication gate, and an AWS ALB performs OIDC token validation. The API trusts the ALB-injected headers without running its own auth server.

This means the API code sees the same headers in both environments. The authentication boundary shifts, but the API’s interface to it doesn’t change.

Tech Stack at a Glance

Here’s the complete technology inventory — deliberately brief, because each component gets deeper treatment in later cairns.

Layer	Technology	Role
API	Go 1.26	Core application language
Messaging	Watermill	Command bus, event bus, in-memory message routing
GraphQL	gqlgen	Schema-first GraphQL code generation
Frontend	Next.js (TypeScript)	Server-side rendering, React UI
UI Components	HeroUI + Tailwind CSS	Component library and utility-first CSS
Database	PostgreSQL 18 + PostGIS	Event store, projections, geographic queries
Tracing	OpenTelemetry + Jaeger	Distributed tracing across services
Auth (dev)	Keycloak + OpenResty	Local OIDC with ALB header injection
Auth (prod)	Cloudflare Access + AWS ALB OIDC	Zero-trust edge auth + load balancer OIDC
Deployment	ArgoCD	GitOps continuous delivery to EKS
Infrastructure	OpenTofu	Infrastructure as code (Terraform fork)
Container	Docker + EKS	Local development and production hosting

Definition

PostGIS is the geographic extension for PostgreSQL. Strike uses it for pager list regions — each pager list has a polygon geometry, and when an ECO is created, Strike queries ST_Contains(region_geometry, eco_point) to find the right contact list for that location.

Getting the System Running Locally

A developer’s first experience with a codebase is the setup process. Strike invests heavily in making that process reproducible and fast.

The toolchain starts with mise (a polyglot tool manager, think asdf but faster). The repo’s .mise.toml declares exact versions of Go, Node, just, direnv, and other tools. A single mise install gets the right versions of everything without polluting the system. The combination of mise + direnv means environment setup is automatic. Enter the project directory, direnv loads .envrc which activates mise tools and sets service URLs. Leave the directory, and your shell goes back to normal. No global installs, no version conflicts.

just is the task runner — a modern alternative to make without the implicit rules and tab-sensitivity. The root justfile imports sub-modules for each package, giving you a hierarchical command structure:

just api dev          # Run Go API with hot reload
just web dev          # Run Next.js dev server
just mock dev         # Start mock server
just services up      # Start PostgreSQL and Keycloak via Docker Compose
just db fresh         # Drop, migrate, and seed the database
just check            # Run all quality gates across all packages
just doctor           # Verify your environment is correctly configured

Docker Compose manages infrastructure services with a profile system designed around the principle “start what you’re not touching”:

Working on…	Command	Docker runs…
API	`just services up`	PostgreSQL, Keycloak
Frontend	`just services up` then `just api dev`	PostgreSQL, Keycloak
Mock server	`just services up`	PostgreSQL, Keycloak
Everything containerized	`docker compose --profile all up`	All services

The mock server’s scenario system deserves special mention. Before testing an ECO flow, you select a preset scenario in the mock dashboard (happy path, render failure, pager timeout, second-accepts, etc.). The mock server then orchestrates the entire integration sequence — Render status transitions, Twilio webhook callbacks — with realistic timing. This replaces manual API manipulation for integration testing.

Tip

For the fastest path from clone to running system: mise trust && mise install, then just setup && just services up && just db fresh. In three separate terminals: just mock dev, just api dev, just web dev. The mock server must be running before the API starts — it provides the simulated Render and Twilio endpoints.

Summary

Osprey Strike is a mono-repo with four packages: a Go GraphQL API (the core), a Next.js frontend, a Go mock server for integration simulation, and an experimental MCP server.
Requests flow through a CQRS pipeline: commands go through aggregates and the event store; queries read from denormalized projection tables. The transactional outbox guarantees no events are lost between write and read sides.
Multi-tenancy is two-dimensional — NOC tenants create ECOs, OSP tenants work them. The many-to-many relationship between NOCs and OSPs, combined with dual tenant context on every ECO, makes this the system's most structurally interesting challenge.
External integrations connect Strike to Render Networks (polling-based field task management), Twilio (webhook-driven pager dispatch), and a dual-mode authentication stack that behaves identically in development and production.
The development toolchain — mise, just, direnv, Docker Compose profiles, and the mock scenario system — prioritizes reproducible, fast setup for new developers.

Discussion Prompts

The dual-tenant model adds complexity to every query and every access control check. What would a simpler model look like, and what would you lose?
The transactional outbox polls every 100ms. Under what conditions would you want to replace polling with a notification mechanism, and what would you gain?
The mock server uses preset scenarios rather than step-by-step manual control. When would you need finer-grained control, and how would you design it?

References

Watermill — Go library for working with message streams, used for command bus, event bus, and CQRS infrastructure.
gqlgen — Schema-first Go GraphQL server code generation.
PostGIS — Geographic extension for PostgreSQL, used for pager list region matching.
mise — Polyglot tool version manager (Go, Node, just, direnv, etc.).
just — Command runner for project-specific recipes.
OpenTelemetry — Observability framework for distributed tracing across Strike services.
Martin Fowler: CQRS — Canonical reference for Command Query Responsibility Segregation.

← When the Fiber Goes Dark Events All the Way Down →

Generated by Cairns · Agent-powered with Claude

← Back to Trailhead