Cairn · Mar 21, 2026

What's Next

Where Osprey Strike stands today, what's coming, and the questions that still need answers · ~14 min read · Suggested by Bob execpmengineer

There's a gap between a system that works and a product that's finished. It's the gap where the interesting decisions live — the ones you can't make in a design document because they require evidence you don't have yet. Osprey Strike handles emergencies end-to-end, deploys to production via GitOps, and serves multiple tenants. It also has unanswered questions about auth boundaries, architectural simplification, and what "done" even means for a pre-PMF product. This final cairn maps the territory ahead.

Before reading this

When the Fiber Goes Dark

What’s Built

As of March 2026, Osprey Strike handles the full emergency callout lifecycle. An NOC operator creates an ECO, the system integrates with Render Networks, dispatches pagers via Twilio, and tracks the investigation through resolution. This isn’t a prototype — it runs in production on AWS EKS, deploys via ArgoCD, and enforces authentication through four layers from Cloudflare’s edge to the application’s JWT validation.

The system as it exists today:

Full ECO lifecycle — create, dispatch, investigate, resolve. Seven mock scenarios exercise every path, from successful dispatch to pager exhaustion.
Multi-tenant architecture — NOC and OSP tenants coexist with strict data isolation. Tenant context flows through every layer, from the GraphQL resolver to the PostgreSQL row filter.
Pager dispatch via Twilio — a process manager orchestrates the ring-down workflow with timeouts, retry logic, and webhook callbacks. The Twilio integration is designed for per-NOC subaccounts, with the secrets infrastructure already in place.
Render Networks integration — polling-based synchronization detects task changes via fingerprint comparison and pushes updates to subscribers in real time.
PostGIS regions — geographic region geometry drives pager list management. An ECO’s location determines which contacts get called, and the region boundaries are stored as PostGIS geometries.
Tenant settings — dynamic configuration per tenant, including map zoom levels and display preferences, stored in a dedicated settings table and cached for performance.
Production deployment — ArgoCD watches the repository, automatic image updates trigger deployments, and the infrastructure is defined in OpenTofu. The cost model runs both environments for less than a team lunch.

Key Takeaway

The system works. The question isn’t whether it handles emergencies — it does. The question is which of the many possible next steps will create the most value with the least risk.

Recent Momentum

The git log tells a story of a codebase that’s actively tightening, not just expanding. The last few weeks have focused on operational polish — the kind of work that doesn’t make demo reels but makes systems reliable.

Tenant settings and map intelligence. The most visible recent feature is dynamic map zoom configuration. ECO creation, viewing, and list contexts each have independent zoom levels, configurable per tenant. This replaced hard-coded POINT_ZOOM constants with context-aware settings that survive deployment and reflect each NOC’s geographic reality — an urban NOC needs different defaults than one covering hundreds of miles of rural fiber. The tenant settings work included fixing a race condition where the map component’s settings fetch and the form’s initial render competed with each other, causing a brief flicker on the ECO view page. The kind of bug that’s easy to reproduce, maddening to debug, and invisible once fixed.

Bug fixes that matter. Two fixes in particular deserve mention. A form synchronization race condition caused intermittent state corruption during ECO creation — the kind of bug that appears once in twenty uses and makes operators distrust the system. A separate duplicate ECO prevention fix caught a navigation timing issue where post-submit redirects could trigger a second creation event. Both are gone now.

Observability. OpenTelemetry tracing was wired into both the API and web deployments, with Jaeger available locally and OTLP export configured per environment. When something goes wrong in production, the team can now trace a request from the Next.js frontend through the GraphQL layer to the database and back.

Build automation. Image updates now trigger automatically — when a new API or web image is built, a commit updates the deployment manifests, and ArgoCD picks up the change. No human touches a deployment file for routine releases.

Beads integration. The project management tooling was upgraded to Beads 0.61, switching from Dolt remotes to git backup sync. The old approach took 10+ seconds per operation; the new one completes in milliseconds. Small tool improvements compound — Bob and Noam coordinate asynchronously across different working hours, and every second of friction in their coordination tools is a second stolen from product work.

The Blockers

Four items stand between the current state and the multi-tenancy worktree’s merge criteria. They’re documented in WORKTREE_GOALS.md, and none of them are optional.

WebSocket authentication. GraphQL subscriptions don’t currently receive authenticated user context. The HTTP layer validates JWTs correctly — Cloudflare Access, ALB OIDC, and the Go middleware form a solid chain. But WebSocket connections negotiate their own auth via connectionParams, and that path isn’t wired to the same validation stack. Until it is, subscriptions can’t enforce tenant isolation. The target architecture mirrors the HTTP path: the WebSocket handshake carries a Bearer token in connectionParams, the Go API validates it with the same JWT middleware, and the subscription resolver receives the same auth context as any query or mutation. Same code paths, different transport.

Mock auth removal. The development mock authentication system — persona-based switching without real credentials — still ships in production builds. The plan is a local Keycloak instance that mirrors the AWS ALB OIDC pattern, allowing the mock system to be removed entirely via build tags. Until Keycloak works locally, removing mock auth breaks the development workflow.

End-to-end ECO flow testing. The individual pieces work. The integrated flow — create ECO, see it appear via subscription, watch pager dispatch, observe Render integration — hasn’t been validated as a continuous sequence with authenticated user context. This is the difference between “the parts work” and “the system works.”

E2E Playwright infrastructure. The Playwright test framework needs to run against Keycloak, not mock auth. This is downstream of the Keycloak work — you can’t test authenticated flows without a real authentication provider — but it’s separately tracked because the test infrastructure itself needs setup beyond just having Keycloak available.

Warning

These four items are interleaved. WebSocket auth depends on understanding the JWT flow. Mock auth removal depends on Keycloak. E2E tests depend on both. The merge criteria document says “ALL required” — partial completion doesn’t unlock the merge.

The Phase 2 Question

Somewhere in the background, a larger architectural question waits patiently. The architecture simplification roadmap laid out a two-phase evolution: Phase 1 ports the system from Kotlin/Axon to Go/Watermill (done), and Phase 2 simplifies from CQRS/Event Sourcing to straightforward CRUD.

Phase 2 promises a 54% code reduction. No event store, no aggregates, no projections, no transactional outbox. Direct PostgreSQL operations, WebSocket hub for real-time updates, background workers for polling and timeouts. The estimated result: ~5,500 lines of Go instead of ~11,900. Onboarding time drops from 2–3 weeks to 1–2.

The January review recommended deferring Phase 2, and the reasoning was sound: Phase 1 already eliminated the operational pain (Axon Server, JVM memory, startup times), the architecture earned an A- grade, and migration introduces regression risk. But the recommendation came with explicit triggers for revisiting the decision:

If onboarding new developers takes more than three weeks
If CQRS-related bugs exceed 20% of total issues
If the product direction shifts significantly

Those triggers haven’t been measured yet because no new developers have onboarded and the bug database is too small for statistical claims. The decision framework is ready. The data isn’t.

Definition

CQRS (Command Query Responsibility Segregation) separates the system’s read and write models. Write operations go through aggregates that enforce business rules and emit events. Read operations query projections built from those events. The separation adds complexity but enables independent scaling and temporal queries. Whether that complexity pays for itself depends on the system’s actual usage patterns.

What matters here isn’t the answer — it’s that the question has a framework. When the PM hire arrives, when the team grows, when the product’s direction stabilizes or pivots, the decision tree is documented. The variables are named. The thresholds are set. That’s what good architecture decisions look like: not “we decided X forever,” but “we’ll decide X when we know Y.”

The Road Ahead

Product roadmaps for pre-PMF software are aspirational by nature. The themes below represent the directions the product could go, ordered roughly by proximity to the current codebase. Some are weeks away. Some are quarters.

OSP-side features. Today, Osprey Strike is overwhelmingly an NOC tool. The OSP contractor — the person who actually drives to the damaged fiber and fixes it — sees very little of the system. A contractor portal, a field technician mobile view, and job acceptance workflows would close the loop between “someone accepted the page” and “the repair is complete.” This is where the dual-tenant model pays off architecturally: the OSP tenant structure already exists, waiting for its UI.

Deeper Render integration. The current integration is unidirectional: Osprey Strike pushes ECOs to Render and polls for task changes. Bidirectional updates — where Render task completions automatically advance ECO status — would eliminate manual status tracking. The fingerprint-based polling mechanism is the foundation, but the business logic for “which Render status maps to which ECO transition” needs product definition.

Reporting and analytics. The event store is an underused asset. Every ECO state transition, every pager call attempt, every Render task change is recorded with timestamps. Temporal queries — average time from ECO creation to dispatch, pager acceptance rate by region, mean time to repair by fiber type — are the kind of insights that would make Osprey Strike valuable beyond its operational function. The data is there. The query layer and visualization aren’t. If Phase 2 simplification happens, the event store disappears. But the audit_log table proposed in the simplification roadmap captures the same state transitions. The temporal query capability survives either path — the access pattern changes, not the data.

Per-NOC Twilio configuration. The design is already written. Each NOC tenant gets its own Twilio subaccount, pager mode (live or mock), and encrypted credentials stored via the existing secrets infrastructure. The tenant_secrets table, the pagerMode enum, the per-tenant URL resolution in the pager processor — it’s all specified. Implementation is a matter of executing the plan.

Attachment support. ECOs currently have an attachments field in the schema that accepts JSON but has no upload pipeline behind it. The S3 infrastructure exists in the AWS account. Connecting the two — presigned upload URLs from the API, a file picker in the UI, thumbnails in the ECO detail view — is straightforward plumbing with no architectural unknowns.

Horizontal scaling. The current architecture uses Watermill’s in-memory pub/sub (gochannel), which means a single API instance handles all event routing. For the current traffic volume, this is correct — adding Redis or NATS would be premature complexity. But the Watermill abstraction means the swap is a configuration change, not an architecture change. When traffic warrants multiple API replicas processing events concurrently, the path is clear.

What This Trail Taught Us

This is the sixth cairn in a trail that started with a damaged fiber cable and ends with a product roadmap. Along the way, we documented an event-sourced architecture, a pager dispatch system, a multi-tenant data model, and a production deployment stack. The trail was written while the system was being built — not after, as retrospective documentation usually is.

That timing matters. Writing about the architecture while building it forced clarity on decisions that might otherwise have been made implicitly. The Phase 2 decision framework exists because writing about the Go port’s tradeoffs made the team articulate what “good enough” meant. The pager process manager’s timeout logic is well-understood because explaining it for the cairn revealed edge cases the tests didn’t cover. Bob and Noam work different hours with minimal overlap. These cairns serve double duty: they’re knowledge artifacts for the broader team, and they’re asynchronous context transfer between two developers who rarely occupy the same meeting. Writing for a public audience forces a completeness that “I’ll explain it on our next call” never achieves.

The trail also surfaces its own gaps. Writing about tenant settings reveals that per-NOC Twilio configuration is designed but not built. Writing about the blockers makes their interdependencies visible in a way that a task list doesn’t. Writing about the roadmap forces the question: which of these themes is actually next?

That question is deliberately unanswered. This cairn is a conversation starter, not a conclusion. The team, the stakeholders, and the eventual PM will make the prioritization calls. The trail’s job is to make sure those calls are informed.

Tip

Which of the topics in this trail deserves its own cairn? The per-NOC Twilio design is a candidate — it’s fully specified and architecturally interesting. The CQRS decision framework could stand alone as a general-purpose guide to “when to simplify.” The OSP contractor portal would require product discovery that hasn’t happened yet, but writing the cairn might be how that discovery starts.

Summary

Osprey Strike handles the full ECO lifecycle in production: create, dispatch, investigate, resolve — with multi-tenant isolation, Twilio pager dispatch, Render Networks integration, and PostGIS-driven region management.
Recent development has focused on operational polish: tenant-specific map settings, race condition fixes, observability instrumentation, and build automation that removes humans from the deployment loop.
Four interleaved blockers gate the multi-tenancy merge: WebSocket auth, mock auth removal via Keycloak, end-to-end flow testing, and Playwright E2E infrastructure. They must be resolved together.
The Phase 2 CQRS simplification question has a documented decision framework with measurable triggers — but the measurements require events (new developer onboarding, sustained bug tracking) that haven't occurred yet.
The product roadmap spans OSP-side features, bidirectional Render integration, analytics from the event store, per-NOC Twilio configuration, attachment support, and horizontal scaling — each at a different distance from the current codebase.
Writing this trail while building the system created clarity that retrospective documentation can't replicate. The trail surfaces its own next questions.

Discussion Prompts

Of the four merge blockers, which should the team attack first — and does the interdependency between them dictate a specific sequence, or can they be parallelized?
The Phase 2 decision framework names onboarding time as a trigger. With a PM hire incoming and potential new developers after that, should we instrument onboarding now — measure how long it takes the PM to understand the CQRS architecture — even though they won't be writing code?
Which product roadmap theme creates the most value for the least effort? Is it the per-NOC Twilio configuration (already designed), the OSP contractor portal (new product surface), or the reporting layer (monetization potential)?

References

When the Fiber Goes Dark — Part 1 of this trail. The domain problem that drives everything else.
The Shape of the System — Part 2. The architecture, including the Phase 2 simplification roadmap.
Events All the Way Down — Part 3. The event-sourced foundation and why it was chosen.
The ECO Lifecycle — Part 4. The state machine that governs emergency callouts.
Running in Production — Part 5. The deployment stack, authentication layers, and cost model.
Architecture Simplification Roadmap (internal) — The decision framework for Phase 2 CQRS simplification, including the trigger thresholds and estimated code reduction.
Twilio Secrets Design (internal) — Per-NOC Twilio configuration design, including tenant_secrets schema and pagerMode enum.
WORKTREE_GOALS.md (internal) — Multi-tenancy worktree merge criteria and current blocker status.

Generated by Cairns · Agent-powered with Claude

← Back to Trailhead