The Shape of the System
How Osprey Strike is structured, from mono-repo to multi-tenant · ~16 min read · Suggested by Bob pmengineer
Before you can understand why a system behaves the way it does, you need to see how it's put together. Osprey Strike isn't a single application — it's four packages in a mono-repo, connected by event streams and separated by tenant boundaries. This cairn maps the architecture from directory structure to request flow.
Four Packages, One Repository
Osprey Strike lives in a mono-repo with four packages, each with its own language, build system, and deployment artifact. They share a root justfile for orchestration, but they’re otherwise independent — you can build and test any package without touching the others.
graph TD
Root["osprey-strike/"]
Root --> API["packages/api/
Go GraphQL API"]
Root --> Web["packages/web/
Next.js Frontend"]
Root --> Mock["packages/mock/
Go Mock Server"]
Root --> MCP["packages/mcp/
MCP Server"]
API --- DB[(PostgreSQL 18
+ PostGIS)]
Mock -.->|simulates| RN[Render Networks]
Mock -.->|simulates| TW[Twilio]
Web -->|GraphQL| API
API -->|webhooks| Mock
The packages break down like this:
packages/api/ is the core. A Go application built on Watermill for messaging and gqlgen for GraphQL. This is where event sourcing, CQRS, command handlers, process managers, and projections all live. If Strike has a heart, it’s here.
The API package alone contains the complete domain model, event store, transactional outbox, projection handlers, and all external integration logic. Later cairns in this trail will dissect each of these individually.
packages/web/ is the user-facing frontend. A Next.js application using TypeScript, React, Tailwind CSS, and HeroUI components. It provides the ECO management interface — creating callouts, monitoring status, switching between tenant contexts, and viewing geographic data on maps.
packages/mock/ is the development mock server. A Go application that simulates Render Networks and Twilio APIs for local development. It runs an orchestrated scenario system — you pick a scenario (happy path, render failure, pager timeout), and the mock server walks through the entire integration sequence with realistic timing.
The mock server replaces what used to be a Python script for Render simulation and a Kotlin mock class for Twilio. Unifying both into a single Go service enabled cross-integration scenario orchestration — something that wasn’t possible when the mocks were separate.
packages/mcp/ is an experimental AI integration prototype. A Model Context Protocol server that exposes Strike operations to AI agents. Still early — the other three packages are where the production system lives.
The mono-repo isn’t a monolith. Each package deploys independently, with its own Docker image and CI pipeline. The mono-repo structure buys you atomic cross-package changes and a single place to reason about the whole system.
How a Request Flows Through the System
Understanding Strike means understanding the path a request takes from a user clicking “Create ECO” to the data appearing on everyone’s screen. The system uses CQRS — Command Query Responsibility Segregation — which means writes and reads take fundamentally different paths.
Here’s the write path when a NOC operator creates an Emergency Callout:
sequenceDiagram
participant Browser
participant Web as Next.js
participant GQL as GraphQL API
participant CB as Command Bus
participant Agg as ECO Aggregate
participant ES as Event Store
participant OB as Outbox Publisher
participant Proj as Projection
participant Views as eco_views
Browser->>Web: Create ECO form submit
Web->>GQL: mutation createECO(input)
GQL->>CB: Send CreateECOCommand
CB->>Agg: Route to ECO handler
Agg->>Agg: Validate business rules
Agg->>ES: Append ECOCreatedEvent
Note over ES: events + outbox in<br/>single DB transaction
ES-->>GQL: Command accepted
GQL-->>Web: Return ECO ID
Web-->>Browser: Navigate to ECO detail
OB->>ES: Poll for unpublished events
ES-->>OB: ECOCreatedEvent
OB->>Proj: Publish to event bus
Proj->>Views: INSERT into eco_views
Note over Views: Read model now<br/>reflects the new ECO
The critical insight is the split between acknowledgment and projection. The browser gets a response as soon as the event is durably stored. The read model updates asynchronously — typically within 100ms, but the write path doesn’t wait for it. This is the “eventual” in eventual consistency, and it’s a deliberate architectural choice that buys scalability and reliability. In practice, the latency between event storage and projection update is imperceptible to users. The outbox publisher polls every 100ms, and the projection handler runs in-process. But the decoupling matters for correctness — it guarantees events are never lost, even if the projection handler crashes.
The read path is simpler. Queries go directly to the projection tables — eco_views and pager_run_views — which are standard SQL tables optimized for the exact queries the UI needs. No event replay, no aggregate loading, just a SELECT.
CQRS (Command Query Responsibility Segregation): A pattern that separates the write model (commands → events) from the read model (projections → queries). Writes go through aggregates and the event store; reads go directly to denormalized view tables. The event bus connects the two sides.
The Two-Dimensional Tenant Problem
Most multi-tenant systems have one kind of tenant. Osprey Strike has two — and they interact. This is arguably the most structurally interesting problem in the system.
As we covered in When the Fiber Goes Dark, the ECO lifecycle involves two distinct organizations: the NOC (Network Operations Center) that detects the outage and creates the callout, and the OSP (Outside Plant) contractor that dispatches crews and performs the repair. These are different companies, with different users, different data, and different needs.
graph TD
subgraph NOC Tenants
N1[Metro NOC]
N2[County NOC]
N3[Rural NOC]
end
subgraph OSP Tenants
O1[FieldWork OSP]
O2[QuickFix OSP]
O3[Regional OSP]
end
N1 -->|active link| O1
N1 -->|active link| O2
N2 -->|active link| O1
N3 -->|active link| O3
subgraph OSP Resources
O1 --- R1[Render Instance]
O1 --- P1[Pager Lists]
O1 --- J1[Job Sequence]
O2 --- R2[Render Instance]
O2 --- P2[Pager Lists]
O2 --- J2[Job Sequence]
end
Each tenant type owns different things:
- NOC tenants create ECOs and need visibility into progress. They have Twilio subaccounts for pager phone numbers. They choose which OSP contractor to dispatch.
- OSP tenants receive dispatch and do the field work. They own Render API credentials, pager contact lists (organized by geographic region), and job number sequences.
The relationship between NOCs and OSPs is many-to-many. A NOC can work with multiple OSP contractors (dispatching to whoever covers that geography), and an OSP can serve multiple NOC clients. These relationships are tracked in a noc_osp_links table with their own lifecycle — pending, active, or suspended.
Every ECO has dual tenant context: a noc_tenant_id (who created it) and an osp_tenant_id (who works it). Both are required. This means ECO visibility rules aren’t a simple “show me my tenant’s data” — a NOC sees ECOs they created, an OSP sees ECOs assigned to them, and the same ECO appears in both views.
The data isolation model is row-level, enforced in the repository layer. Every query includes a tenant filter derived from the authenticated user’s active tenant context. Users can belong to multiple tenants (NOC and/or OSP), and they select which tenant they’re operating as via the UI — that selection flows through as an X-Tenant-ID header on every API request.
One subtle but important design choice: ECOs store direct tenant references rather than joining through the noc_osp_links table. This means if a NOC-OSP relationship is later suspended or severed, historical ECOs remain visible to both parties. The link table controls future ECO creation, not historical visibility.
External Integrations
Strike doesn’t operate in isolation. It connects to external systems for field task management, pager dispatch, and authentication. Each integration has its own architectural character.
Render Networks
Render Networks is the field task management platform that OSP contractors use to coordinate crews, track work orders, and manage field operations. Strike integrates with Render to push ECO details into the contractor’s existing workflow.
The integration is polling-based with fingerprint change detection. When an ECO is created and integrated with Render, Strike creates a project and tasks in the Render API. It then polls for status changes — not on a fixed timer, but using a fingerprint comparison that detects when task data has actually changed. This avoids hammering the Render API with redundant requests while still catching updates promptly. Each OSP tenant has its own Render API credentials, stored as envelope-encrypted secrets in the database. The polling service uses a client factory with TTL-based caching and singleflight deduplication — so concurrent requests for the same tenant’s client don’t stampede the token endpoint.
The polling service runs a configurable worker pool. Each Render instance (per OSP tenant) has its own polling interval, inactivity TTL, and grace period. When all tasks for an ECO reach a terminal state, polling enters a grace period before stopping entirely. Stale detection catches ECOs that go quiet without completing.
Twilio
Twilio handles pager dispatch — the automated process of calling through an OSP’s contact list to find someone who can respond to the emergency. Strike sends call requests to the Twilio API and receives webhook callbacks as the call progresses.
The pager workflow is coordinated by a process manager: a state machine backed by a PostgreSQL table (pager_process_state) with a background worker that checks for timeouts. When a pager run starts, it calls the first person on the list. If they don’t answer within 30 seconds, it moves to the next. If everyone declines or times out, the run is marked as exhausted.
The pager process manager deliberately avoids distributed saga frameworks. Instead, it uses a PostgreSQL state table that you can query with SELECT * FROM pager_process_state WHERE eco_id = '...'. When paging fails at 3 AM, you want psql, not framework log archaeology.
Authentication
Authentication follows a dual-mode pattern:
- Development: Keycloak runs as a Docker Compose service, pre-loaded with a
strike-devrealm and test users. An OpenResty reverse proxy sits in front, performing OIDC authentication and injecting ALB-style headers — mimicking the production flow locally. - Production: Cloudflare Access handles the first authentication gate, and an AWS ALB performs OIDC token validation. The API trusts the ALB-injected headers without running its own auth server.
This means the API code sees the same headers in both environments. The authentication boundary shifts, but the API’s interface to it doesn’t change.
Tech Stack at a Glance
Here’s the complete technology inventory — deliberately brief, because each component gets deeper treatment in later cairns.
| Layer | Technology | Role |
|---|---|---|
| API | Go 1.26 | Core application language |
| Messaging | Watermill | Command bus, event bus, in-memory message routing |
| GraphQL | gqlgen | Schema-first GraphQL code generation |
| Frontend | Next.js (TypeScript) | Server-side rendering, React UI |
| UI Components | HeroUI + Tailwind CSS | Component library and utility-first CSS |
| Database | PostgreSQL 18 + PostGIS | Event store, projections, geographic queries |
| Tracing | OpenTelemetry + Jaeger | Distributed tracing across services |
| Auth (dev) | Keycloak + OpenResty | Local OIDC with ALB header injection |
| Auth (prod) | Cloudflare Access + AWS ALB OIDC | Zero-trust edge auth + load balancer OIDC |
| Deployment | ArgoCD | GitOps continuous delivery to EKS |
| Infrastructure | OpenTofu | Infrastructure as code (Terraform fork) |
| Container | Docker + EKS | Local development and production hosting |
PostGIS is the geographic extension for PostgreSQL. Strike uses it for pager list regions — each pager list has a polygon geometry, and when an ECO is created, Strike queries ST_Contains(region_geometry, eco_point) to find the right contact list for that location.
Getting the System Running Locally
A developer’s first experience with a codebase is the setup process. Strike invests heavily in making that process reproducible and fast.
The toolchain starts with mise (a polyglot tool manager, think asdf but faster). The repo’s .mise.toml declares exact versions of Go, Node, just, direnv, and other tools. A single mise install gets the right versions of everything without polluting the system.
The combination of mise + direnv means environment setup is automatic. Enter the project directory, direnv loads .envrc which activates mise tools and sets service URLs. Leave the directory, and your shell goes back to normal. No global installs, no version conflicts.
just is the task runner — a modern alternative to make without the implicit rules and tab-sensitivity. The root justfile imports sub-modules for each package, giving you a hierarchical command structure:
just api dev # Run Go API with hot reload
just web dev # Run Next.js dev server
just mock dev # Start mock server
just services up # Start PostgreSQL and Keycloak via Docker Compose
just db fresh # Drop, migrate, and seed the database
just check # Run all quality gates across all packages
just doctor # Verify your environment is correctly configured
Docker Compose manages infrastructure services with a profile system designed around the principle “start what you’re not touching”:
| Working on… | Command | Docker runs… |
|---|---|---|
| API | just services up |
PostgreSQL, Keycloak |
| Frontend | just services up then just api dev |
PostgreSQL, Keycloak |
| Mock server | just services up |
PostgreSQL, Keycloak |
| Everything containerized | docker compose --profile all up |
All services |
The mock server’s scenario system deserves special mention. Before testing an ECO flow, you select a preset scenario in the mock dashboard (happy path, render failure, pager timeout, second-accepts, etc.). The mock server then orchestrates the entire integration sequence — Render status transitions, Twilio webhook callbacks — with realistic timing. This replaces manual API manipulation for integration testing.
For the fastest path from clone to running system: mise trust && mise install, then just setup && just services up && just db fresh. In three separate terminals: just mock dev, just api dev, just web dev. The mock server must be running before the API starts — it provides the simulated Render and Twilio endpoints.
Summary
- Osprey Strike is a mono-repo with four packages: a Go GraphQL API (the core), a Next.js frontend, a Go mock server for integration simulation, and an experimental MCP server.
- Requests flow through a CQRS pipeline: commands go through aggregates and the event store; queries read from denormalized projection tables. The transactional outbox guarantees no events are lost between write and read sides.
- Multi-tenancy is two-dimensional — NOC tenants create ECOs, OSP tenants work them. The many-to-many relationship between NOCs and OSPs, combined with dual tenant context on every ECO, makes this the system's most structurally interesting challenge.
- External integrations connect Strike to Render Networks (polling-based field task management), Twilio (webhook-driven pager dispatch), and a dual-mode authentication stack that behaves identically in development and production.
- The development toolchain — mise, just, direnv, Docker Compose profiles, and the mock scenario system — prioritizes reproducible, fast setup for new developers.
Discussion Prompts
- The dual-tenant model adds complexity to every query and every access control check. What would a simpler model look like, and what would you lose?
- The transactional outbox polls every 100ms. Under what conditions would you want to replace polling with a notification mechanism, and what would you gain?
- The mock server uses preset scenarios rather than step-by-step manual control. When would you need finer-grained control, and how would you design it?
References
- Watermill — Go library for working with message streams, used for command bus, event bus, and CQRS infrastructure.
- gqlgen — Schema-first Go GraphQL server code generation.
- PostGIS — Geographic extension for PostgreSQL, used for pager list region matching.
- mise — Polyglot tool version manager (Go, Node, just, direnv, etc.).
- just — Command runner for project-specific recipes.
- OpenTelemetry — Observability framework for distributed tracing across Strike services.
- Martin Fowler: CQRS — Canonical reference for Command Query Responsibility Segregation.
Generated by Cairns · Agent-powered with Claude