Cairn · Mar 26, 2026

The Lakehouse

What Osprey Vantage is actually building, and why it looks the way it does · ~16 min read · Suggested by Bob technicalbusiness

You've heard "data lake." You've heard "data warehouse." Now there's a word that sounds like someone couldn't decide between them. They couldn't — and that indecision turned out to be the right answer. Here's what a lakehouse is, why the medallion pattern structures it, and what Osprey Vantage is building on top of both.

The Data Problem Nobody Talks About

Outside plant fiber construction generates an extraordinary amount of data that nobody treats as data. Crew dispatch logs in spreadsheets. Equipment inventories in someone’s head. Project timelines in email threads. Billing records in one system, material usage in another, and the actual state of the physical network in a third — if you’re lucky enough to have a third.

This isn’t a technology problem. It’s an archaeology problem. The information exists, scattered across PDFs, CAD files, Excel workbooks, handwritten field notes, and the institutional memory of people who’ve been doing this for twenty years. What doesn’t exist is a single place where all of it lives in a form that lets you ask questions across the whole picture. The OSP industry’s data fragmentation isn’t unique — it’s a pattern across field-service industries where digital tooling arrived unevenly. What makes fiber construction distinctive is the combination of geospatial complexity, regulatory overhead, and real-time operational pressure.

That’s the problem Osprey Vantage is designed to solve. Not by replacing every system — that’s a fool’s errand — but by giving every system a place to pour its data and then asking questions that span the whole mess.

To do that, you need an architecture. Specifically, you need a lakehouse.

Two Buildings That Don’t Quite Work

Before we get to the lakehouse, it helps to understand the two things it replaced — and why neither was sufficient on its own.

The Data Warehouse is the well-organized library. Everything has a catalog entry. Every book sits on the right shelf. If you want to find something, you know exactly where to look. The tradeoff: every book had to be processed, cataloged, and shelved before it entered the building. That processing — the transformation step — is expensive, slow, and rigid. Change the catalog system and you’re re-shelving everything.

Data warehouses excel at structured, predictable queries. They struggle with data that doesn’t fit the predetermined schema, with unstructured content, and with the kind of exploratory analysis where you don’t know what question you’re asking until you see the data.

The Data Lake is the opposite philosophy. Throw everything in. Raw files, structured records, sensor readings, images — the lake accepts all of it with no questions asked. Cheap storage, infinite flexibility, zero opinions about structure.

The problem: lakes without governance become swamps. When everything goes in and nothing gets organized, the “just query it” promise collapses into “nobody can find anything, and what they find, they can’t trust.” The “data swamp” phenomenon is well-documented. A 2023 Gartner study found that organizations with data lakes but no governance layer reported that less than 30% of stored data was actively used for analysis. The rest was write-only storage — expensive, untrustworthy, and growing.

Key Takeaway

A warehouse is too rigid. A lake is too loose. The lakehouse is the architectural recognition that you need both — raw preservation and progressive refinement — in a single system.

The Lakehouse: Structured Freedom

The lakehouse combines the warehouse’s reliability guarantees — ACID transactions, schema enforcement, time travel — with the lake’s storage economics and format flexibility. Data lives in open file formats on cheap object storage (like S3), but a metadata layer provides the structure and guarantees you’d expect from a traditional database.

Think of it as a lake with a librarian. The water is still there — raw, deep, inclusive. But now someone is maintaining a card catalog, enforcing checkout rules, and ensuring that what you find on the shelf matches what the catalog says is there.

The key enabling technologies:

Open table formats like Delta Lake and Apache Iceberg provide ACID transactions over files stored in object storage
Decoupled compute and storage means you pay for storage at lake prices and only spin up compute when you need it
Schema-on-read means raw data enters unmodified; structure is applied when it’s consumed, not when it arrives

This matters for Vantage because OSP construction data comes from everywhere, in every format, on every schedule. APIs return JSON. Field reports arrive as PDFs. Equipment sensors emit telemetry. A rigid schema-on-write system would reject half of it. A schema-on-read lakehouse accepts all of it and applies structure progressively.

The Medallion Pattern

Raw data isn’t useful data. Useful data isn’t ready for dashboards. This is the fundamental insight behind the medallion architecture — data doesn’t go from “raw” to “ready” in one step. It goes through stages, and each stage has a different purpose, a different quality bar, and a different audience.

graph LR
    Sources["External APIs<br/>Field Data<br/>Sensors"] --> Bronze
    Bronze["Bronze<br/>Raw, Immutable"] --> Silver["Silver<br/>Cleansed, Conformed"]
    Silver --> Gold["Gold<br/>Business-Ready"]
    Gold --> Consumers["Dashboards<br/>Reports<br/>ML Models"]

Three layers. Three quality tiers. One direction of flow.

Bronze: The Audit Trail

Bronze is the raw layer. Data arrives exactly as the source system produced it — no transformation, no filtering, no opinions. Every record is preserved. Every API response is archived.

Why keep the raw data? Three reasons:

Reproducibility. If a transformation in Silver is wrong, you can reprocess from Bronze without going back to the source system.
Audit trail. Regulated industries need to prove what data they received, when they received it, and what they did with it.
Schema evolution. When a source API adds a field or changes a format, Bronze captures the change. Downstream layers adapt on their own schedule.

Definition

Bronze layer principle: Accept everything, modify nothing. The source system is the authority. Bronze is the faithful record.

Silver: The Clean Room

Silver is where data gets trustworthy. Deduplication, validation, standardization, enrichment — all the work that turns “data from an API” into “data a human can reason about.”

In a construction context, Silver is where:

Duplicate crew dispatch records get merged
Equipment IDs get standardized across vendors
GPS coordinates get validated against known service areas
Missing fields get flagged rather than silently propagated

Silver isn’t business logic. It’s data hygiene — the work that should happen regardless of what you plan to do with the data.

Gold: The Answer Layer

Gold is business-ready. Aggregated, modeled, optimized for specific questions. Star schemas for BI tools. Feature tables for machine learning. Pre-computed metrics for dashboards.

Gold is where “how many crews were dispatched last month” becomes a query that returns in milliseconds instead of a query that scans terabytes. It’s the layer where data serves a purpose, not just exists.

The important constraint: Gold is derived from Silver. Always. If the underlying data changes, Gold rebuilds from the cleansed layer — never from raw. This guarantees that every business metric traces back to validated, deduplicated source data.

ELT: Transform Where the Power Is

If you’ve worked with data pipelines before, you’ve probably encountered ETL — Extract, Transform, Load. It’s the traditional pattern: pull data from a source, transform it in a staging area, then load it into the destination.

Vantage uses ELT instead. The difference is the order of the last two steps, and it matters more than it sounds.

graph LR
    subgraph ETL
        direction LR
        E1["Extract"] --> T1["Transform<br/>(staging server)"] --> L1["Load"]
    end
    subgraph ELT
        direction LR
        E2["Extract"] --> L2["Load<br/>(raw to lake)"] --> T2["Transform<br/>(in the lakehouse)"]
    end

ETL transforms data before it reaches the destination. This means you need a dedicated transformation server, you need to know your schema up front, and if you get the transformation wrong, you’ve lost the raw data.

ELT loads raw data first, then transforms it inside the destination system. The raw data is preserved (that’s your Bronze layer). Transformations run against the full power of the lakehouse compute engine rather than a separate staging server. And if you need to change the transformation logic? Re-run it. The raw data is still there.

Tip

ELT isn’t just a reordering — it’s a philosophy. “Keep everything, refine progressively” aligns naturally with the medallion pattern. Bronze is the “L” in ELT. Silver and Gold are successive “T” stages.

For a small team building a data platform, ELT has a practical advantage: you don’t need to build and maintain a transformation server. The cloud data platform is the transformation engine. One less thing to operate.

What Vantage Is Building — Today

Architecture diagrams are aspirational. Code is honest. Here’s what Osprey Vantage looks like in practice, as of this writing.

The first component in production is a Go-based scraper Lambda — a serverless function that fetches data from external APIs and writes it to S3. It follows the adapter pattern: one binary, multiple deployments, each configured for a different data source.

graph TD
    EB["EventBridge<br/>(scheduled)"] --> Lambda["Scraper Lambda<br/>(Go)"]
    Lambda --> SM["Secrets Manager"]
    Lambda --> S3["S3 Landing Zone<br/>(raw JSONL)"]
    S3 --> SQS["SQS Queue"]
    SQS --> Future["Bronze+ Ingest<br/>(future)"]

The scraper is deliberately minimal. It doesn’t validate data against schemas — that’s the Bronze+ Ingest Lambda’s job. It doesn’t deduplicate — that’s Silver. It doesn’t aggregate — that’s Gold. It does one thing: fetch a complete snapshot from a source API and write it to S3 as JSONL with Hive-style partitioning. Hive-style partitioning organizes files in directories like source=render/date=2026-03-26/. This convention lets query engines (Athena, Spark, Trino) automatically partition scans, dramatically reducing the data read for time-bounded or source-filtered queries.

This is the “E” and the first “L” in ELT. Data enters the system. It lands in the Landing Zone. It triggers a message on SQS. Everything downstream consumes from there.

The Adapter Pattern

The scraper’s adapter interface is worth examining because it reveals a design philosophy:

type Adapter interface {
    Name() string
    Fetch(ctx context.Context, creds Credentials) ([]json.RawMessage, error)
}

json.RawMessage — not a typed struct, not a parsed model. Raw bytes. The scraper treats source data as opaque. It knows how to authenticate and paginate, but it has no opinion about what the data contains.

This is the correct boundary. Schema knowledge at the scraper level creates coupling — when a source API adds a field, the scraper breaks. By deferring schema validation to the Bronze+ layer, the scraper stays resilient to upstream changes.

Example: Adding a New Data Source

@Dev We need to pull data from Paylocity for workforce analytics.

@Architect Write a Paylocity adapter — implement Fetch() with their auth flow and pagination. Register it in the adapter map. Add an OpenTofu deployment block with the schedule and secret name. Same binary, new config.

Infrastructure as Code

The entire infrastructure lives in OpenTofu (the open-source Terraform fork). S3 buckets, SQS queues, Lambda functions, IAM roles, EventBridge schedules — all version-controlled, all reproducible. Local development uses LocalStack to simulate AWS services without incurring cloud costs.

Tip

OpenTofu + LocalStack is an underappreciated combination for small teams. You get infrastructure-as-code discipline without the “every test costs money” problem. The integration test suite runs a complete pipeline — Secrets Manager seeded, Lambda invoked, S3 file written, SQS notification verified — on a developer’s laptop.

What Vantage Is Building — Tomorrow

The honest answer about the layers beyond the scraper is: they’re designed but not yet implemented. That’s not a gap — it’s a sequence. Building data platforms in the wrong order is how you get expensive, untested abstractions.

Here’s the planned progression:

Bronze+ Ingest Lambda (Go) — Consumes SQS messages triggered by new Landing Zone files. Validates records against typed Go structs. Writes valid records as Parquet files to the Bronze bucket. Bad records go to a dead-letter queue for inspection.

Why Go for Bronze? Two reasons: the validation logic benefits from Go’s type system, and Lambda cold-start time matters for event-driven processing. Go Lambdas start in milliseconds.

Silver Promotion Worker (Python + delta-rs) — Reads from Bronze Parquet files. Applies deduplication, merging, and standardization. Writes Delta Lake tables to S3. This is where data becomes queryable with ACID guarantees.

Why Python for Silver? The data engineering ecosystem lives in Python. Libraries like delta-rs, pandas, and PyArrow are the standard tools for this work. Choosing the right language for each layer — Go for the high-throughput ingestion boundary, Python for the transformation logic — is a pragmatic split. This Go/Python split is a recurring pattern in modern data platforms. Go handles the I/O-heavy, latency-sensitive edges. Python handles the transformation and analytical middle. Each language plays to its strengths.

Gold Layer — The least defined and intentionally so. Gold depends on knowing what questions the business wants to answer. Those questions are still crystallizing. Building Gold before the questions are clear produces pretty dashboards that answer the wrong things.

Bob’s observation in the channel was exactly right: “Things get fuzzy in the gold-semantic-to-UX levels.” That fuzziness isn’t a failure of planning. It’s appropriate humility about a layer that should be demand-driven, not supply-driven.

Why This Matters for OSP

A lakehouse is a generic architecture. What makes it interesting for outside plant construction specifically?

Multiple data source types. OSP operations touch APIs (project management tools, billing systems), geospatial data (GIS records, GPS coordinates), time-series data (equipment sensors, network monitoring), and unstructured content (field reports, photos, permit documents). A warehouse alone can’t handle this range. A lakehouse can.

Regulatory and audit requirements. Construction companies operating on public rights-of-way face documentation requirements from municipalities, states, and federal agencies. The Bronze layer’s “keep everything, modify nothing” principle creates a natural audit trail. Every data point can be traced to its source, timestamped, and preserved in its original form.

Cross-system questions. The real value isn’t in any single data source. It’s in the joins. “Which crews were dispatched to emergency callouts involving equipment from supplier X in the past quarter, and what was the mean time to resolution?” That question touches dispatch data, equipment inventory, supplier records, and incident management — four systems that currently don’t talk to each other.

Key Takeaway

The lakehouse isn’t valuable because it stores data. It’s valuable because it makes questions across data sources possible. The joins are where the insight lives.

Operational intelligence at scale. As Vantage matures, the Gold layer becomes the foundation for dashboards, automated alerting, and eventually predictive analytics. Material usage patterns that predict supply shortages. Crew dispatch patterns that optimize routing. Equipment failure patterns that trigger preventive maintenance before a backhoe hits a fiber line.

None of that is possible when the data lives in spreadsheets.

The Semantic Layer Question

There’s a layer between Gold and the user that the industry calls the “semantic layer” — and it’s the piece of Vantage that’s most deliberately left open.

The semantic layer maps technical column names to business concepts. It’s where svc_area_id becomes “Service Area” and mttr_min becomes “Mean Time to Repair (minutes).” It’s also where access control lives — who can see which metrics, which dimensions are available to which roles.

For Vantage, the semantic layer is where natural language querying becomes possible. “Show me all emergency callouts in the Midwest region last month” translating to an actual SQL query against Gold tables — that’s the semantic layer at work. Natural language to SQL is one of the more practical LLM applications in the data space. The semantic layer provides the mapping that makes it reliable — without it, an LLM is guessing at column names and table relationships.

This layer is intentionally unbuilt. Not because it’s unimportant, but because building it before the Gold layer exists means building a translation dictionary for a language that hasn’t been written yet. The sequence matters.

Summary

A lakehouse combines the best of both worlds — the flexibility and cost of a data lake with the reliability and query performance of a data warehouse. Open table formats like Delta Lake enable ACID transactions over files in object storage.
The medallion pattern (Bronze → Silver → Gold) is progressive refinement — raw data enters unchanged, gets cleansed and standardized, then gets shaped for business consumption. Each layer has a distinct purpose and quality bar.
ELT loads first, transforms later — preserving raw data and leveraging the lakehouse's compute for transformations rather than maintaining separate staging infrastructure. This aligns naturally with the medallion layers.
Vantage's current state is the scraper — and that's the right starting point — a Go Lambda that fetches data from external APIs and lands it in S3. Schema validation, deduplication, and aggregation are deliberately deferred to subsequent layers that don't exist yet.
The Gold and semantic layers are intentionally undefined — they depend on business questions that are still crystallizing. Building answer infrastructure before the questions are clear produces expensive decorations, not useful tools.

Discussion Prompts

What are the first cross-system questions you'd want to ask if all of our operational data lived in one queryable place? Identifying these shapes what Gold looks like.
Which existing data sources beyond the current API integrations should be prioritized for ingestion? Field reports, GIS data, equipment telemetry — what would unlock the most value soonest?
As the semantic layer takes shape, who should be able to query it directly — and what does "directly" mean? SQL? Natural language? Pre-built dashboards? The answer determines how much of the semantic layer we build versus buy.

References

What is Medallion Architecture? — Databricks' canonical explanation of the Bronze/Silver/Gold pattern, including schema evolution and ACID guarantees over Delta Lake tables.
Medallion Architecture — Data Engineering Wiki — Community-maintained reference with practical guidance on when to add sub-layers and when the three-layer pattern is sufficient.
The 2025-2026 Guide to the Data Lakehouse — Comprehensive overview of the lakehouse ecosystem, including Delta Lake, Apache Iceberg, and the convergence of warehouse and lake capabilities.
Delta Lake — The open-source storage layer that brings ACID transactions, time travel, and schema enforcement to data stored in Parquet on object storage. The planned format for Vantage's Silver layer.
ETL vs. ELT: Key Differences Explained — Practical comparison of ETL and ELT patterns, with attention to cloud-native architectures where ELT's "transform in place" model reduces infrastructure complexity.
OSP Management Challenges in Telecom — Industry perspective on data fragmentation, real-time visibility gaps, and the tooling challenges specific to outside plant fiber construction.
OpenTofu — The open-source infrastructure-as-code tool (Terraform fork) used by Vantage for reproducible AWS deployments across local, dev, staging, and production environments.

Generated by Cairns · Agent-powered with Claude

← Back to Trailhead