The Production Topology

Every request to Osprey Strike passes through four distinct infrastructure layers before it reaches the application. This isn’t accidental complexity — each layer serves a specific purpose that the others can’t fulfill. The stack runs in AWS, fronted by Cloudflare, and deploys to two environments: main (continuous integration, auto-deploys on merge) and demo (stakeholder demos, manually promoted).

graph TD
    User["User / Browser"]
    Twilio["Twilio Webhooks"]

    CF["Cloudflare
    WAF · CDN · DDoS"]
    CFA["Cloudflare Access
    Zero Trust"]
    CFW["Cloudflare Worker
    twilio-webhook-fwd"]

    ALB["AWS ALB
    OIDC Auth · SG Whitelist"]

    subgraph EKS["EKS Cluster"]
        direction TB
        subgraph NS_MAIN["osprey-main namespace"]
            API_M["Go API"]
            WEB_M["Next.js Web"]
            MOCK_M["Mock Server"]
            MCP_M["MCP Server"]
        end
        subgraph NS_DEMO["osprey-demo namespace"]
            API_D["Go API"]
            WEB_D["Next.js Web"]
        end
        ARGO["ArgoCD
        osprey-argocd namespace"]
    end

    RDS["RDS PostgreSQL 18
    + PostGIS"]

    User --> CF
    Twilio --> CFW
    CFW --> CFA
    CF --> CFA
    CFA --> ALB
    ALB --> EKS
    API_M --> RDS
    API_D --> RDS

The critical design decision here is the security group whitelist. The ALB only accepts traffic from Cloudflare’s published IP ranges. Even if someone discovers the ALB’s DNS name, their connection is refused at the network level. This means Cloudflare’s WAF, DDoS protection, and Zero Trust policies can’t be bypassed — they’re enforced architecturally, not by hoping nobody tries the direct URL. Cloudflare publishes their IP ranges at cloudflare.com/ips/. An OpenTofu module uses a Lambda function to keep the ALB security group in sync automatically — Cloudflare adds a new range, the security group updates within hours.

The two environments share a single EKS cluster, separated by Kubernetes namespaces. Each environment gets its own ALB (via AWS ALB Controller’s group.name annotation), its own RDS database, and its own Cloudflare Access policies. Main runs two API replicas for high availability; demo runs one.

Key Takeaway

The ALB security group whitelist is the linchpin of the network security model. Without it, every other layer is advisory. With it, you get defense in depth that doesn’t depend on application-level enforcement.

The Four Auth Layers

Four authentication layers sounds like the kind of thing an enterprise architect draws on a whiteboard to justify headcount. In this case, each layer exists because the one before it can’t do the next thing needed.

graph TD
    REQ["Incoming Request"]

    L1["Layer 1: Cloudflare Access
    Identity verification at edge
    Google Workspace · OTP fallback"]

    L2["Layer 2: AWS ALB OIDC
    JWT validation · ES256
    Session cookie management"]

    L3["Layer 3: Keycloak via nginx
    Token exchange · User provisioning
    Groups and roles mapping"]

    L4["Layer 4: API Middleware
    Tenant context extraction
    RBAC enforcement · Sys admin check"]

    APP["Application Logic"]

    REQ --> L1
    L1 --> L2
    L2 --> L3
    L3 --> L4
    L4 --> APP

Layer 1: Cloudflare Access decides who can reach the application at all. It’s Zero Trust authentication at the edge — before the request even touches AWS. Users authenticate via Google Workspace. External stakeholders who aren’t in Google Workspace can use email OTP. Office and VPN IP ranges can bypass login entirely for developer convenience. The bypass rules are per-environment. Main allows bypass from known office IPs. Demo always requires authentication — you don’t want a stakeholder demo to accidentally show them an unauthenticated view.

Layer 2: AWS ALB OIDC provides the session management that Cloudflare Access doesn’t. When a user’s request reaches the ALB, it performs an OIDC flow against Keycloak, creates an AWSALBAuthSessionCookie, and injects three headers into every request: x-amzn-oidc-data (a signed JWT with user claims), x-amzn-oidc-identity (the user’s email), and x-amzn-oidc-accesstoken. The application never sees raw passwords and never manages session cookies.

Layer 3: Keycloak (via nginx OIDC-RP) handles the actual identity management — user provisioning, group membership, role mapping. In production, it’s the OIDC provider that ALB authenticates against. In local development, nginx acts as an OIDC Relying Party that simulates the exact same header injection pattern, so the application code is identical in both environments.

Layer 4: API middleware extracts tenant context from the authenticated user and enforces RBAC. This is where “authenticated user” becomes “Bob from KCCI with access to tenants A and B but not C.” System administrators get elevated access based on Keycloak group membership, not database flags.

The ES256 Padding Story

AWS ALB signs its JWTs with ES256 — but it pads the base64url segments with trailing = characters, which violates RFC 7515. The Go golang-jwt library rejects padded tokens by default, so every request returns 401.

The fix is a single function call: jwt.WithPaddingAllowed(). But the obvious alternative — stripping the padding before validation — would break signature verification, because the ECDSA signature was computed over the original padded segments. Change the input, invalidate the signature.

Warning

Don’t strip base64url padding from ALB JWTs. The signature was computed over the padded content. Use jwt.WithPaddingAllowed() (golang-jwt v5.2.0+) to accept padding without altering the signing input.

Build-Tag Security

The Go API uses build tags to make authentication bypass impossible in production binaries:

// config_secure.go — //go:build !dev
// Auth is always enabled, always enforced.
// Environment variables are ignored.

// config_dev.go — //go:build dev
// Auth respects ALB_SIGNATURE_VALIDATION_ENABLED
// and WEBHOOK_AUTH_ENABLED environment variables.

Even if CI/CD accidentally sets ALB_SIGNATURE_VALIDATION_ENABLED=false, a production binary compiled without the dev tag will ignore the variable and enforce authentication. Mock auth is compile-time excluded, not runtime toggled.

The Auth Evolution

This architecture didn’t arrive fully formed. The original plan was Keycloak self-hosted as the sole identity provider — a full IAM server deployed alongside the application. That worked but carried operational burden: another service to patch, monitor, and back up, plus its own PostgreSQL database.

The first simplification was nginx OIDC-RP for local development. Instead of running Keycloak on every developer’s machine and maintaining keycloak-js in the frontend, OpenResty (nginx with Lua) acts as an OIDC relying party that injects the same x-amzn-oidc-* headers the ALB does. The frontend code became identical across environments — zero auth code in the Next.js build. This eliminated keycloak-js, the silent-check-sso iframe, session cookie management in Next.js, and the entire token-getter pattern. The cleanup removed more code than the nginx OIDC-RP configuration added.

The second simplification was Cloudflare Access as the perimeter layer. Rather than relying solely on ALB OIDC (which anyone could reach if they knew the ALB DNS name), Cloudflare Access provides identity verification at the edge — before the request reaches AWS at all. This is when the ALB security group whitelist became critical: it ensures the only path to the application goes through Cloudflare.

GitOps with ArgoCD

Nobody SSHes into production. Nobody runs kubectl apply. Merge to main, and the change is live. That’s the contract, and ArgoCD is the enforcer.

graph LR
    DEV["Developer"]
    GH["GitHub
    main branch"]
    CI["GitHub Actions
    Build · Test · Push"]
    ECR["AWS ECR
    Docker Registry"]
    IU["ArgoCD Image Updater
    Watches digest changes"]
    ARGO["ArgoCD
    GitOps Controller"]
    EKS["EKS Cluster
    Live Deployment"]

    DEV -->|merge PR| GH
    GH -->|triggers| CI
    CI -->|push image| ECR
    ECR -->|digest changed| IU
    IU -->|commit new digest| GH
    GH -->|manifest changed| ARGO
    ARGO -->|sync| EKS

The flow for main (automatic):

  1. A developer merges a PR to main.
  2. GitHub Actions runs tests, builds Docker images for each package (API, Web, Mock, MCP), and pushes them to ECR tagged as branch-main.
  3. ArgoCD Image Updater detects that the branch-main tag’s SHA digest has changed.
  4. Image Updater commits the updated digest to the Kustomize overlay in the repo.
  5. ArgoCD detects the manifest change and syncs it to the EKS cluster.

The flow for demo (manual): a developer creates a git tag or explicitly updates the demo overlay’s image tags. ArgoCD syncs the change. No automatic deployment — demo stability matters for stakeholder presentations.

Kustomize Overlays

Each component follows a base/overlays pattern:

infrastructure/kubernetes/
├── api/
│   ├── base/           # Shared: deployment, service, configmap
│   └── overlays/
│       ├── main/       # Namespace, replicas, ingress, env config
│       └── demo/       # Same structure, different values
├── web/
│   └── (same pattern)
├── mock/
│   └── (same pattern)
└── mcp/
    └── (same pattern)

The base defines the deployment shape (resource limits, health checks, ports). The overlays inject environment-specific configuration: namespace (osprey-main vs osprey-demo), replica count (2 vs 1), Render mode (MOCK vs SANDBOX), and ALB group names. Main runs with mock external integrations for development speed; demo runs against real Render API in sandbox mode. ArgoCD Application resources aren’t static YAML. They’re generated dynamically by OpenTofu — a Cartesian product of environments × applications. Add a new component or environment, and the ArgoCD apps create themselves.

ArgoCD Dashboard Access

ArgoCD itself needs authentication for its dashboard. Rather than running another Keycloak client, it uses its bundled Dex connector with Google Workspace via SAML. Developers log in with their Google account. RBAC is applied based on email and group membership. No additional identity infrastructure required.

Tip

The ArgoCD Image Updater uses a digest strategy with mutable branch tags. CI always pushes to branch-main. Image Updater detects the tag’s SHA changed, commits the new digest, and ArgoCD syncs. This avoids the proliferation of image tags while maintaining an auditable git history of every deployment.

Infrastructure as Code

If you can’t recreate the infrastructure from scratch using a single command, you don’t have infrastructure as code — you have infrastructure as folklore. Osprey Strike’s cloud resources are defined in OpenTofu modules, organized by concern.

The OpenTofu modules break down by responsibility:

Module Creates
vpc VPC, subnets, NAT gateway, route tables
eks EKS cluster, node groups, IRSA roles
rds PostgreSQL instances per environment
acm-certificates Wildcard SSL certificates with DNS validation
argocd/setup ArgoCD installation on the cluster
argocd/configuration ArgoCD applications (dynamic from environments × components)
eso/setup External Secrets Operator installation
eso/configuration SecretStore and ExternalSecret resources
keycloak/* OIDC clients, realm config, DNS records, groups mapper
cf-worker Cloudflare Worker for Twilio webhook forwarding
cloudflare-ips Lambda that syncs Cloudflare IP ranges to ALB security group
jumpbox EC2 bastion for secure EKS access
github-oidc-provider AWS OIDC provider for GitHub Actions (keyless CI auth)
prefix-list Managed prefix lists for security group rules
s3-attachments S3 bucket for file attachments

Secrets don’t live in OpenTofu state or Kubernetes manifests. AWS Secrets Manager holds sensitive values (Render API credentials, encryption keys), and the External Secrets Operator (ESO) syncs them into Kubernetes Secrets. The API pods reference Kubernetes Secrets in their environment variables — they never know the secrets came from AWS.

The jumpbox provides secure access when you need to debug something in the cluster. It’s an EC2 instance in a private subnet with an SSH tunnel to the EKS API server. No direct EKS API exposure to the internet, no broad VPN access. The GitHub OIDC provider deserves a mention. GitHub Actions authenticates to AWS using OIDC federation — no long-lived AWS credentials stored in GitHub Secrets. The IAM role trusts GitHub’s OIDC provider and scopes access by repository and branch.

Definition

External Secrets Operator (ESO): A Kubernetes operator that reads secrets from external providers (AWS Secrets Manager, HashiCorp Vault, etc.) and creates Kubernetes Secrets. It handles rotation automatically — update the secret in AWS, and ESO propagates the change without redeploying pods.

Horizontal Scalability

The system earns a B+ for horizontal scalability. That’s not a euphemism for “it doesn’t scale.” It means the core architecture is sound for multi-instance deployment, with specific, documented gaps that have known solutions.

What Works (A-Grade)

Stateless authentication. Every request carries its own proof of identity via JWT. No server-side sessions, no shared session store. The load balancer can use round-robin routing — any instance can handle any request.

Event sourcing with optimistic locking. Multiple API instances can safely process commands against the same aggregate. The database enforces uniqueness on (aggregate_id, version). If two instances try to write the same version concurrently, one gets ErrVersionConflict and retries. No distributed locks required.

Idempotent projections. Event handlers that build read models are safe to replay. Database-backed, deterministic, and independent. You can add instances without worrying about projection state divergence.

Transactional outbox. Events are appended to the outbox in the same database transaction as the aggregate write. A background publisher polls unpublished entries and publishes them to Watermill. If an instance crashes after commit, another instance picks up the unpublished entries. Guaranteed eventual delivery.

What Doesn’t Work Yet (The Gaps)

Render polling in-memory state. The Render integration service maintains fingerprints, task counts, and initialization state in sync.Map — process-local memory. With two instances polling the same ECO, each maintains independent state. Instance B sees “all tasks changed” because it doesn’t have Instance A’s fingerprint cache. The symptom: duplicate change events, spurious notifications.

This isn’t just a scaling problem — it’s a restart problem. Every deployment loses the fingerprint cache and triggers a burst of false change events for every active ECO.

WebSocket subscription broadcast. The subscription broker maintains in-memory maps of connected WebSocket clients. Instance A publishes a pager event; Instance B’s clients don’t receive it because they’re in a different process’s memory.

Outbox publisher duplicate processing. Multiple instances poll the same outbox table. Two instances may fetch the same unpublished events and both attempt to publish. Handlers are idempotent, so correctness isn’t affected — but it’s wasted work and confusing metrics.

The Scaling Roadmap

Gap Short-Term Fix Long-Term Fix
Render polling state Externalize to Redis Extract as standalone microservice
WebSocket broadcast ALB sticky sessions Redis Pub/Sub via Watermill
Outbox duplicates Accept (idempotent handlers) FOR UPDATE SKIP LOCKED
Deadline worker races Add lease mechanism Same lease mechanism

For a two-instance redundancy deployment, sticky sessions and accepting idempotent duplicates is fine. The Redis path becomes necessary at three or more instances — about $12/month for an ElastiCache cache.t3.micro.

Key Takeaway

The B+ isn’t aspirational — it’s an honest assessment with a clear upgrade path. Stateless auth, idempotent event processing, and optimistic locking are the hard parts, and they’re already done. The remaining gaps are state externalization problems with known solutions.

Cost Model

The entire production stack runs for approximately $250/month. For a system with two environments, four application components, managed Kubernetes, managed PostgreSQL, and edge security — that’s a reasonable number.

Here’s where the money goes:

Component Monthly Cost Notes
EKS Control Plane ~$73 Fixed cost, doesn’t scale with traffic
EKS Nodes (t3.medium × 2) ~$60 Runs all pods across both environments
Application Load Balancers (× 2) ~$40 One per environment
RDS PostgreSQL (db.t4g.small × 2) ~$60 One per environment
ECR Storage ~$5 Docker image storage
Data Transfer ~$10 Outbound bandwidth
AWS Total ~$248
Cloudflare (Free tier) $0 WAF, CDN, Access, Workers
Grand Total ~$250

The cost components scale differently. EKS control plane is fixed — you pay $73 whether you run one pod or a hundred. Node costs scale with compute needs (add nodes). ALB costs scale with traffic volume (connection hours + processed bytes). RDS scales with instance class (vertical) or read replicas (horizontal). Cloudflare’s free tier covers everything Strike currently needs.

The alternatives were considered and rejected:

  • ECS Fargate + ALB (~$160-230/month): cheaper, but the team has EKS expertise, not ECS. The learning curve cost exceeds the monthly savings.
  • App Runner + Amplify (~$115-180/month): cheapest, but you can’t enforce security group whitelists on ingress. The *.awsapprunner.com URLs are publicly accessible, which defeats the Cloudflare perimeter model.
  • EKS + self-hosted Keycloak (~$280-320/month): Keycloak adds another pod, another database, and another service to patch and monitor. Cloudflare Access does the same job at the edge for $0. The Cloudflare free tier is remarkably generous for this use case. WAF rules, DDoS protection, CDN caching, Zero Trust Access policies, Workers — all covered. The paid tier ($20/month for Pro) would add advanced WAF rules and analytics, but isn’t necessary yet.
Tip

If budget is a concern, the single largest fixed cost is the EKS control plane at $73/month. For a lower-traffic product, ECS Fargate would cut that to zero (you pay only for task runtime). The trade-off is rewriting Kubernetes manifests as ECS task definitions and learning a different deployment model.

Summary

  1. The production stack flows from Cloudflare (edge security, CDN, Zero Trust) through AWS ALB (OIDC auth, security group whitelist) to an EKS cluster running four application components across two namespaced environments, backed by RDS PostgreSQL with PostGIS.
  2. Four authentication layers provide defense in depth: Cloudflare Access (perimeter identity), ALB OIDC (session management), Keycloak via nginx OIDC-RP (user provisioning and groups), and API middleware (tenant context and RBAC). Build-tag security ensures mock auth is compile-time excluded from production binaries.
  3. GitOps via ArgoCD means deployments are git commits, not SSH sessions. Merge to main triggers image builds; ArgoCD Image Updater detects digest changes; ArgoCD syncs to the cluster. Demo deployments are manually promoted for stability.
  4. Infrastructure is defined in OpenTofu modules covering EKS, RDS, VPC, certificates, secrets, and Cloudflare Workers. External Secrets Operator bridges AWS Secrets Manager to Kubernetes. A jumpbox provides secure cluster access without exposing the EKS API.
  5. Horizontal scalability earns a B+ — stateless auth, event sourcing with optimistic locking, and idempotent projections are production-ready. Render polling state, WebSocket broadcast, and outbox duplicate processing are documented gaps with clear upgrade paths.
  6. The entire stack runs for ~$250/month, with Cloudflare's free tier handling edge security. The EKS control plane ($73) is the largest fixed cost; most other components scale with actual usage.

Discussion Prompts

  • The four auth layers evolved incrementally — each was added to solve a specific problem. If you were designing from scratch today with Cloudflare Access available, would you still need Keycloak? What would you lose by removing it?
  • The ALB security group whitelist is described as the "linchpin" of the security model. What happens if Cloudflare has an outage? Is there a break-glass path to access the application, and should there be?
  • The Render polling service's in-memory state is both a scaling problem and a reliability problem (state lost on restart). Does this change the priority calculus — should it be fixed for single-instance reliability before worrying about multi-instance scaling?

References

  1. Cloudflare Access Documentation — Zero Trust authentication configuration and policy management.
  2. AWS ALB OIDC Authentication — How ALB handles OpenID Connect flows and header injection.
  3. ArgoCD User Management and Dex Configuration — Setting up SSO connectors for ArgoCD dashboard access.
  4. lua-resty-openidc — The OpenID Connect Relying Party library for nginx/OpenResty used in local development auth simulation.
  5. EKS Best Practices Guide — AWS reference for production EKS cluster operation.
  6. External Secrets Operator — Kubernetes operator for synchronizing secrets from external providers.
  7. Cloudflare IP Ranges — The published IP list used for ALB security group whitelisting.