Running in Production
From Cloudflare edge to PostgreSQL — how Osprey Strike deploys, authenticates, and scales · ~16 min read · Suggested by Bob engineerops
Code that works on localhost is a hypothesis. Code running in production is evidence. The gap between the two is where most systems stumble — authentication that assumed a single user, deployments that required SSH and prayer, infrastructure that existed only in someone's head. This cairn traces Osprey Strike's production stack from Cloudflare's edge network to the PostgreSQL rows at the bottom, covering the four authentication layers that protect it, the GitOps pipeline that deploys it, and the cost model that keeps it running for less than a team lunch.
The Production Topology
Every request to Osprey Strike passes through four distinct infrastructure layers before it reaches the application. This isn’t accidental complexity — each layer serves a specific purpose that the others can’t fulfill. The stack runs in AWS, fronted by Cloudflare, and deploys to two environments: main (continuous integration, auto-deploys on merge) and demo (stakeholder demos, manually promoted).
graph TD
User["User / Browser"]
Twilio["Twilio Webhooks"]
CF["Cloudflare
WAF · CDN · DDoS"]
CFA["Cloudflare Access
Zero Trust"]
CFW["Cloudflare Worker
twilio-webhook-fwd"]
ALB["AWS ALB
OIDC Auth · SG Whitelist"]
subgraph EKS["EKS Cluster"]
direction TB
subgraph NS_MAIN["osprey-main namespace"]
API_M["Go API"]
WEB_M["Next.js Web"]
MOCK_M["Mock Server"]
MCP_M["MCP Server"]
end
subgraph NS_DEMO["osprey-demo namespace"]
API_D["Go API"]
WEB_D["Next.js Web"]
end
ARGO["ArgoCD
osprey-argocd namespace"]
end
RDS["RDS PostgreSQL 18
+ PostGIS"]
User --> CF
Twilio --> CFW
CFW --> CFA
CF --> CFA
CFA --> ALB
ALB --> EKS
API_M --> RDS
API_D --> RDS
The critical design decision here is the security group whitelist. The ALB only accepts traffic from Cloudflare’s published IP ranges. Even if someone discovers the ALB’s DNS name, their connection is refused at the network level. This means Cloudflare’s WAF, DDoS protection, and Zero Trust policies can’t be bypassed — they’re enforced architecturally, not by hoping nobody tries the direct URL. Cloudflare publishes their IP ranges at cloudflare.com/ips/. An OpenTofu module uses a Lambda function to keep the ALB security group in sync automatically — Cloudflare adds a new range, the security group updates within hours.
The two environments share a single EKS cluster, separated by Kubernetes namespaces. Each environment gets its own ALB (via AWS ALB Controller’s group.name annotation), its own RDS database, and its own Cloudflare Access policies. Main runs two API replicas for high availability; demo runs one.
The ALB security group whitelist is the linchpin of the network security model. Without it, every other layer is advisory. With it, you get defense in depth that doesn’t depend on application-level enforcement.
The Four Auth Layers
Four authentication layers sounds like the kind of thing an enterprise architect draws on a whiteboard to justify headcount. In this case, each layer exists because the one before it can’t do the next thing needed.
graph TD
REQ["Incoming Request"]
L1["Layer 1: Cloudflare Access
Identity verification at edge
Google Workspace · OTP fallback"]
L2["Layer 2: AWS ALB OIDC
JWT validation · ES256
Session cookie management"]
L3["Layer 3: Keycloak via nginx
Token exchange · User provisioning
Groups and roles mapping"]
L4["Layer 4: API Middleware
Tenant context extraction
RBAC enforcement · Sys admin check"]
APP["Application Logic"]
REQ --> L1
L1 --> L2
L2 --> L3
L3 --> L4
L4 --> APP
Layer 1: Cloudflare Access decides who can reach the application at all. It’s Zero Trust authentication at the edge — before the request even touches AWS. Users authenticate via Google Workspace. External stakeholders who aren’t in Google Workspace can use email OTP. Office and VPN IP ranges can bypass login entirely for developer convenience. The bypass rules are per-environment. Main allows bypass from known office IPs. Demo always requires authentication — you don’t want a stakeholder demo to accidentally show them an unauthenticated view.
Layer 2: AWS ALB OIDC provides the session management that Cloudflare Access doesn’t. When a user’s request reaches the ALB, it performs an OIDC flow against Keycloak, creates an AWSALBAuthSessionCookie, and injects three headers into every request: x-amzn-oidc-data (a signed JWT with user claims), x-amzn-oidc-identity (the user’s email), and x-amzn-oidc-accesstoken. The application never sees raw passwords and never manages session cookies.
Layer 3: Keycloak (via nginx OIDC-RP) handles the actual identity management — user provisioning, group membership, role mapping. In production, it’s the OIDC provider that ALB authenticates against. In local development, nginx acts as an OIDC Relying Party that simulates the exact same header injection pattern, so the application code is identical in both environments.
Layer 4: API middleware extracts tenant context from the authenticated user and enforces RBAC. This is where “authenticated user” becomes “Bob from KCCI with access to tenants A and B but not C.” System administrators get elevated access based on Keycloak group membership, not database flags.
The ES256 Padding Story
AWS ALB signs its JWTs with ES256 — but it pads the base64url segments with trailing = characters, which violates RFC 7515. The Go golang-jwt library rejects padded tokens by default, so every request returns 401.
The fix is a single function call: jwt.WithPaddingAllowed(). But the obvious alternative — stripping the padding before validation — would break signature verification, because the ECDSA signature was computed over the original padded segments. Change the input, invalidate the signature.
Don’t strip base64url padding from ALB JWTs. The signature was computed over the padded content. Use jwt.WithPaddingAllowed() (golang-jwt v5.2.0+) to accept padding without altering the signing input.
Build-Tag Security
The Go API uses build tags to make authentication bypass impossible in production binaries:
// config_secure.go — //go:build !dev
// Auth is always enabled, always enforced.
// Environment variables are ignored.
// config_dev.go — //go:build dev
// Auth respects ALB_SIGNATURE_VALIDATION_ENABLED
// and WEBHOOK_AUTH_ENABLED environment variables.
Even if CI/CD accidentally sets ALB_SIGNATURE_VALIDATION_ENABLED=false, a production binary compiled without the dev tag will ignore the variable and enforce authentication. Mock auth is compile-time excluded, not runtime toggled.
The Auth Evolution
This architecture didn’t arrive fully formed. The original plan was Keycloak self-hosted as the sole identity provider — a full IAM server deployed alongside the application. That worked but carried operational burden: another service to patch, monitor, and back up, plus its own PostgreSQL database.
The first simplification was nginx OIDC-RP for local development. Instead of running Keycloak on every developer’s machine and maintaining keycloak-js in the frontend, OpenResty (nginx with Lua) acts as an OIDC relying party that injects the same x-amzn-oidc-* headers the ALB does. The frontend code became identical across environments — zero auth code in the Next.js build.
This eliminated keycloak-js, the silent-check-sso iframe, session cookie management in Next.js, and the entire token-getter pattern. The cleanup removed more code than the nginx OIDC-RP configuration added.
The second simplification was Cloudflare Access as the perimeter layer. Rather than relying solely on ALB OIDC (which anyone could reach if they knew the ALB DNS name), Cloudflare Access provides identity verification at the edge — before the request reaches AWS at all. This is when the ALB security group whitelist became critical: it ensures the only path to the application goes through Cloudflare.
GitOps with ArgoCD
Nobody SSHes into production. Nobody runs kubectl apply. Merge to main, and the change is live. That’s the contract, and ArgoCD is the enforcer.
graph LR
DEV["Developer"]
GH["GitHub
main branch"]
CI["GitHub Actions
Build · Test · Push"]
ECR["AWS ECR
Docker Registry"]
IU["ArgoCD Image Updater
Watches digest changes"]
ARGO["ArgoCD
GitOps Controller"]
EKS["EKS Cluster
Live Deployment"]
DEV -->|merge PR| GH
GH -->|triggers| CI
CI -->|push image| ECR
ECR -->|digest changed| IU
IU -->|commit new digest| GH
GH -->|manifest changed| ARGO
ARGO -->|sync| EKS
The flow for main (automatic):
- A developer merges a PR to
main. - GitHub Actions runs tests, builds Docker images for each package (API, Web, Mock, MCP), and pushes them to ECR tagged as
branch-main. - ArgoCD Image Updater detects that the
branch-maintag’s SHA digest has changed. - Image Updater commits the updated digest to the Kustomize overlay in the repo.
- ArgoCD detects the manifest change and syncs it to the EKS cluster.
The flow for demo (manual): a developer creates a git tag or explicitly updates the demo overlay’s image tags. ArgoCD syncs the change. No automatic deployment — demo stability matters for stakeholder presentations.
Kustomize Overlays
Each component follows a base/overlays pattern:
infrastructure/kubernetes/
├── api/
│ ├── base/ # Shared: deployment, service, configmap
│ └── overlays/
│ ├── main/ # Namespace, replicas, ingress, env config
│ └── demo/ # Same structure, different values
├── web/
│ └── (same pattern)
├── mock/
│ └── (same pattern)
└── mcp/
└── (same pattern)
The base defines the deployment shape (resource limits, health checks, ports). The overlays inject environment-specific configuration: namespace (osprey-main vs osprey-demo), replica count (2 vs 1), Render mode (MOCK vs SANDBOX), and ALB group names. Main runs with mock external integrations for development speed; demo runs against real Render API in sandbox mode.
ArgoCD Application resources aren’t static YAML. They’re generated dynamically by OpenTofu — a Cartesian product of environments × applications. Add a new component or environment, and the ArgoCD apps create themselves.
ArgoCD Dashboard Access
ArgoCD itself needs authentication for its dashboard. Rather than running another Keycloak client, it uses its bundled Dex connector with Google Workspace via SAML. Developers log in with their Google account. RBAC is applied based on email and group membership. No additional identity infrastructure required.
The ArgoCD Image Updater uses a digest strategy with mutable branch tags. CI always pushes to branch-main. Image Updater detects the tag’s SHA changed, commits the new digest, and ArgoCD syncs. This avoids the proliferation of image tags while maintaining an auditable git history of every deployment.
Infrastructure as Code
If you can’t recreate the infrastructure from scratch using a single command, you don’t have infrastructure as code — you have infrastructure as folklore. Osprey Strike’s cloud resources are defined in OpenTofu modules, organized by concern.
The OpenTofu modules break down by responsibility:
| Module | Creates |
|---|---|
vpc |
VPC, subnets, NAT gateway, route tables |
eks |
EKS cluster, node groups, IRSA roles |
rds |
PostgreSQL instances per environment |
acm-certificates |
Wildcard SSL certificates with DNS validation |
argocd/setup |
ArgoCD installation on the cluster |
argocd/configuration |
ArgoCD applications (dynamic from environments × components) |
eso/setup |
External Secrets Operator installation |
eso/configuration |
SecretStore and ExternalSecret resources |
keycloak/* |
OIDC clients, realm config, DNS records, groups mapper |
cf-worker |
Cloudflare Worker for Twilio webhook forwarding |
cloudflare-ips |
Lambda that syncs Cloudflare IP ranges to ALB security group |
jumpbox |
EC2 bastion for secure EKS access |
github-oidc-provider |
AWS OIDC provider for GitHub Actions (keyless CI auth) |
prefix-list |
Managed prefix lists for security group rules |
s3-attachments |
S3 bucket for file attachments |
Secrets don’t live in OpenTofu state or Kubernetes manifests. AWS Secrets Manager holds sensitive values (Render API credentials, encryption keys), and the External Secrets Operator (ESO) syncs them into Kubernetes Secrets. The API pods reference Kubernetes Secrets in their environment variables — they never know the secrets came from AWS.
The jumpbox provides secure access when you need to debug something in the cluster. It’s an EC2 instance in a private subnet with an SSH tunnel to the EKS API server. No direct EKS API exposure to the internet, no broad VPN access. The GitHub OIDC provider deserves a mention. GitHub Actions authenticates to AWS using OIDC federation — no long-lived AWS credentials stored in GitHub Secrets. The IAM role trusts GitHub’s OIDC provider and scopes access by repository and branch.
External Secrets Operator (ESO): A Kubernetes operator that reads secrets from external providers (AWS Secrets Manager, HashiCorp Vault, etc.) and creates Kubernetes Secrets. It handles rotation automatically — update the secret in AWS, and ESO propagates the change without redeploying pods.
Horizontal Scalability
The system earns a B+ for horizontal scalability. That’s not a euphemism for “it doesn’t scale.” It means the core architecture is sound for multi-instance deployment, with specific, documented gaps that have known solutions.
What Works (A-Grade)
Stateless authentication. Every request carries its own proof of identity via JWT. No server-side sessions, no shared session store. The load balancer can use round-robin routing — any instance can handle any request.
Event sourcing with optimistic locking. Multiple API instances can safely process commands against the same aggregate. The database enforces uniqueness on (aggregate_id, version). If two instances try to write the same version concurrently, one gets ErrVersionConflict and retries. No distributed locks required.
Idempotent projections. Event handlers that build read models are safe to replay. Database-backed, deterministic, and independent. You can add instances without worrying about projection state divergence.
Transactional outbox. Events are appended to the outbox in the same database transaction as the aggregate write. A background publisher polls unpublished entries and publishes them to Watermill. If an instance crashes after commit, another instance picks up the unpublished entries. Guaranteed eventual delivery.
What Doesn’t Work Yet (The Gaps)
Render polling in-memory state. The Render integration service maintains fingerprints, task counts, and initialization state in sync.Map — process-local memory. With two instances polling the same ECO, each maintains independent state. Instance B sees “all tasks changed” because it doesn’t have Instance A’s fingerprint cache. The symptom: duplicate change events, spurious notifications.
This isn’t just a scaling problem — it’s a restart problem. Every deployment loses the fingerprint cache and triggers a burst of false change events for every active ECO.
WebSocket subscription broadcast. The subscription broker maintains in-memory maps of connected WebSocket clients. Instance A publishes a pager event; Instance B’s clients don’t receive it because they’re in a different process’s memory.
Outbox publisher duplicate processing. Multiple instances poll the same outbox table. Two instances may fetch the same unpublished events and both attempt to publish. Handlers are idempotent, so correctness isn’t affected — but it’s wasted work and confusing metrics.
The Scaling Roadmap
| Gap | Short-Term Fix | Long-Term Fix |
|---|---|---|
| Render polling state | Externalize to Redis | Extract as standalone microservice |
| WebSocket broadcast | ALB sticky sessions | Redis Pub/Sub via Watermill |
| Outbox duplicates | Accept (idempotent handlers) | FOR UPDATE SKIP LOCKED |
| Deadline worker races | Add lease mechanism | Same lease mechanism |
For a two-instance redundancy deployment, sticky sessions and accepting idempotent duplicates is fine. The Redis path becomes necessary at three or more instances — about $12/month for an ElastiCache cache.t3.micro.
The B+ isn’t aspirational — it’s an honest assessment with a clear upgrade path. Stateless auth, idempotent event processing, and optimistic locking are the hard parts, and they’re already done. The remaining gaps are state externalization problems with known solutions.
Cost Model
The entire production stack runs for approximately $250/month. For a system with two environments, four application components, managed Kubernetes, managed PostgreSQL, and edge security — that’s a reasonable number.
Here’s where the money goes:
| Component | Monthly Cost | Notes |
|---|---|---|
| EKS Control Plane | ~$73 | Fixed cost, doesn’t scale with traffic |
| EKS Nodes (t3.medium × 2) | ~$60 | Runs all pods across both environments |
| Application Load Balancers (× 2) | ~$40 | One per environment |
| RDS PostgreSQL (db.t4g.small × 2) | ~$60 | One per environment |
| ECR Storage | ~$5 | Docker image storage |
| Data Transfer | ~$10 | Outbound bandwidth |
| AWS Total | ~$248 | |
| Cloudflare (Free tier) | $0 | WAF, CDN, Access, Workers |
| Grand Total | ~$250 |
The cost components scale differently. EKS control plane is fixed — you pay $73 whether you run one pod or a hundred. Node costs scale with compute needs (add nodes). ALB costs scale with traffic volume (connection hours + processed bytes). RDS scales with instance class (vertical) or read replicas (horizontal). Cloudflare’s free tier covers everything Strike currently needs.
The alternatives were considered and rejected:
- ECS Fargate + ALB (~$160-230/month): cheaper, but the team has EKS expertise, not ECS. The learning curve cost exceeds the monthly savings.
- App Runner + Amplify (~$115-180/month): cheapest, but you can’t enforce security group whitelists on ingress. The
*.awsapprunner.comURLs are publicly accessible, which defeats the Cloudflare perimeter model. - EKS + self-hosted Keycloak (~$280-320/month): Keycloak adds another pod, another database, and another service to patch and monitor. Cloudflare Access does the same job at the edge for $0. The Cloudflare free tier is remarkably generous for this use case. WAF rules, DDoS protection, CDN caching, Zero Trust Access policies, Workers — all covered. The paid tier ($20/month for Pro) would add advanced WAF rules and analytics, but isn’t necessary yet.
If budget is a concern, the single largest fixed cost is the EKS control plane at $73/month. For a lower-traffic product, ECS Fargate would cut that to zero (you pay only for task runtime). The trade-off is rewriting Kubernetes manifests as ECS task definitions and learning a different deployment model.
Summary
- The production stack flows from Cloudflare (edge security, CDN, Zero Trust) through AWS ALB (OIDC auth, security group whitelist) to an EKS cluster running four application components across two namespaced environments, backed by RDS PostgreSQL with PostGIS.
- Four authentication layers provide defense in depth: Cloudflare Access (perimeter identity), ALB OIDC (session management), Keycloak via nginx OIDC-RP (user provisioning and groups), and API middleware (tenant context and RBAC). Build-tag security ensures mock auth is compile-time excluded from production binaries.
- GitOps via ArgoCD means deployments are git commits, not SSH sessions. Merge to main triggers image builds; ArgoCD Image Updater detects digest changes; ArgoCD syncs to the cluster. Demo deployments are manually promoted for stability.
- Infrastructure is defined in OpenTofu modules covering EKS, RDS, VPC, certificates, secrets, and Cloudflare Workers. External Secrets Operator bridges AWS Secrets Manager to Kubernetes. A jumpbox provides secure cluster access without exposing the EKS API.
- Horizontal scalability earns a B+ — stateless auth, event sourcing with optimistic locking, and idempotent projections are production-ready. Render polling state, WebSocket broadcast, and outbox duplicate processing are documented gaps with clear upgrade paths.
- The entire stack runs for ~$250/month, with Cloudflare's free tier handling edge security. The EKS control plane ($73) is the largest fixed cost; most other components scale with actual usage.
Discussion Prompts
- The four auth layers evolved incrementally — each was added to solve a specific problem. If you were designing from scratch today with Cloudflare Access available, would you still need Keycloak? What would you lose by removing it?
- The ALB security group whitelist is described as the "linchpin" of the security model. What happens if Cloudflare has an outage? Is there a break-glass path to access the application, and should there be?
- The Render polling service's in-memory state is both a scaling problem and a reliability problem (state lost on restart). Does this change the priority calculus — should it be fixed for single-instance reliability before worrying about multi-instance scaling?
References
- Cloudflare Access Documentation — Zero Trust authentication configuration and policy management.
- AWS ALB OIDC Authentication — How ALB handles OpenID Connect flows and header injection.
- ArgoCD User Management and Dex Configuration — Setting up SSO connectors for ArgoCD dashboard access.
- lua-resty-openidc — The OpenID Connect Relying Party library for nginx/OpenResty used in local development auth simulation.
- EKS Best Practices Guide — AWS reference for production EKS cluster operation.
- External Secrets Operator — Kubernetes operator for synchronizing secrets from external providers.
- Cloudflare IP Ranges — The published IP list used for ALB security group whitelisting.
Generated by Cairns · Agent-powered with Claude