Two Layers Shipped This Week. The Hardest One Didn’t.

The Data Product Report: Weekly State of the Market in Data Product Building | Week ending April 27, 2026

Apr 28, 2026

This Week

DuckDB just shipped a Jepsen-validated lakehouse that makes “is this a real warehouse alternative for medium-data” a live question for your next architecture decision. The pattern for wiring agents into your pipeline started to harden: embed them in CDC streams, not in chat windows. And Google Cloud Next showed up with agents in every keynote and a semantic layer that’s still mostly aspirational. Two of the three layers your stack runs on consolidated this week. The one that grounds the other two didn’t.

DuckDB Stops Being a Side Project

Two years ago, DuckDB was the thing you reached for when you needed to query a CSV without spinning up a database. This week, v1.5.2 shipped DuckLake v1.0, expanded Iceberg support, Jepsen-validated correctness, and a 10% TPC-H improvement. All in a single patch release. DuckLake is DuckDB’s own lakehouse spec. The Iceberg extension picks up geometry types, ALTER TABLE, and partitioned deletes. And the Jepsen pass (the gold standard for distributed-systems correctness) surfaced and fixed a primary-key bug, exactly the kind of finding that shifts production trust. The community is already wiring DuckDB into dbt, Rill, and AI assistants as the default analytical engine. Some teams hit out-of-memory issues on billion-row workloads and reach for ClickHouse instead, an honest boundary that tells you where DuckDB fits and where it doesn’t.

What does this look like when practitioners actually build on it? A geospatial engineer ran 3.4 million solar panel records through a DuckDB Spatial pipeline: GPKG to reprojection to WKB to Hilbert-ordered Parquet with ZSTD compression. The workflow is a reusable reference architecture for any spatial dataset. One detail signals where production trust currently sits: the author pinned DuckDB v1.4.4, citing issues with v1.5.1. People are building real pipelines and version-locking because the output matters.

Then Posit shipped ggsql, a SQL-native Grammar of Graphics with VISUALIZE, DRAW, PLACE, SCALE, and LABEL syntax. It works with Parquet, CTEs, and window functions. DuckDB is the natural execution engine. For SQL-first teams, that’s one less reason to context-switch to Python or R for visualization.

The bottom line: DuckDB isn’t the local query tool anymore. With Iceberg compatibility, Jepsen correctness, and a lakehouse spec under the same binary, it’s a credible analytical platform for medium-data workloads. Worth a real evaluation before your next architecture decision. If your team is still treating it the way you did two years ago, the gap between your perception and its capability is widening fast.

Treat Your Agent Like Your Pipeline

Storage’s getting handled. The next decision is what to do when leadership keeps asking why an agent isn’t already wired into the workflow. The practitioner answer that crystallised this week: stop building agents that need babysitting, and start treating them like the data services your platform team already knows how to run.

The clearest framing came from Feldera, in a piece that read like a manifesto for the embed-don’t-chat thesis: expose CDC streams instead of snapshots, build machine-first interfaces, and stop asking the model to be a coworker who needs babysitting. The argument lands here. If you’ve spent the last year writing dbt models that emit CDC events for downstream consumers, you already know how to ship data to an agent. The agent is just another consumer. The framing inversion (“embed in your software” rather than “chat with your humans”) is the one that finally gives data teams a tractable role in agent rollouts.

The discipline gap shows up in vendor risk too. Kimi’s Vendor Verifier runs targeted benchmarks against inference providers to catch silent misconfigs and quant swaps: your model provider quietly shipping a quantised version that scores lower on your eval set without telling you. The pattern is the one your data team already practises on Snowflake and BigQuery: treat the upstream as untrusted, instrument it, alert on drift. Vendor Verifier is dbt-tests-for-your-LLM-provider, and it should be in your evaluation suite by the end of the quarter.

The credential layer matters too. If you don’t already trust your agent stack with your warehouse credentials, you shouldn’t. Kloak uses eBPF to swap hashed placeholders with real secrets at the kernel level, so the agent never sees the credential it’s using. For data teams running agents against production Snowflake or BigQuery in regulated environments, this is the pattern to watch. Credential isolation enforced for any process that touches your warehouse, not just AI.

The bottom line: Agents aren’t a new category of system. They’re a new consumer of the same pipelines, with the same trust requirements. Build CDC interfaces for them, run drift detection on the model provider, and isolate their credentials the way you isolate your reverse-ETL service account. The teams that win this transition will be the ones who treat the agent layer as data infrastructure, not as a chat product they’re afraid of.

The Agent Stack Is Ready. The Semantic Engine Isn’t.

Storage’s consolidating. The agent integration pattern’s hardening. Which leaves the layer that grounds the other two, and that’s the layer Google Cloud Next ‘26 made conspicuously visible by failing to ship.

Olivier Dupuis’ read of Cloud Next cuts to the gap. Agents were everywhere: in apps, in Looker dashboards, in development workflows, in a marathon-planning demo on the keynote stage. Gemini Enterprise got positioned as “the connective tissue between your data, your people, and all of your apps and agents.” But the semantic engine landed differently. The thing that gives those agents a unified business context to ground their answers in shipped as a Knowledge Catalog rebrand of Dataplex, buried in a side announcement. Looker’s LookML still holds the company’s actual semantic modeling, and it wasn’t meaningfully connected to the new agent layer. The Knowledge Catalog isn’t the business context layer. It’s a metadata aggregator with aspirations.

The point isn’t that Google missed. It’s that somebody’s going to own this. Palantir is already branding Foundry as “the ontology-powered operating system for the modern enterprise.” OpenAI is positioning Frontier the same way: model intelligence tied to platform stickiness, switching costs that rise as embedding deepens. As the analysis puts it, “we shouldn’t just slap LLMs and agents on top of our old data stack and expect miracles.” And Joe Reis’s 1905 electrification analogy keeps showing up here for a reason. Factories didn’t gain anything from swapping steam engines for electric motors until they tore the building down and rebuilt it around the new power source. An agent grounded on a fragmented semantic layer is a steam engine with an electric label.

The question isn’t “which vendor’s ontology platform should we wait for.” It’s: where does your business meaning live today? In dbt’s Semantic Layer? In LookML? In a metrics layer in Cube? In MetricFlow? In nobody’s head and three SQL stored procs? Whichever it is, that’s the surface area an agent will eventually call against. The teams ahead are the ones consolidating it now, not because Google said to, but because the work has to happen before the agent layer matters.

The bottom line: Two layers of your stack consolidated this week. The meaning layer is where your leverage actually lives, and no vendor has shipped it. Audit your semantic surface area. Pick the layer you’ll standardise on (dbt Semantic Layer, LookML, Cube, MetricFlow, ontology) and start migrating toward it. The companies that figure out ontology first won’t wait for Google to catch up. The data engineers leading that work won’t either.

The Radar

Quick hits on stories worth knowing about, organized by what you’re building.

If you’re self-hosting models: DeepSeek V4 shipped two MoE variants with million-token context. Hybrid attention cuts compute requirements to roughly a quarter and KV cache to a tenth of previous generations. SGLang and Miles shipped day-zero serving support with prefix caching and optimized MoE kernels, so you don’t need to build your own. If your team is on older DeepSeek models, the API migration deadline is July 24.

If you’re deploying agents: WUPHF builds a multi-agent “office” around a shared Markdown+Git wiki with typed triples and contradiction detection. Worth borrowing if your agents can’t remember what they learned yesterday.

If you care about inference quality: The flinch research shows that even “uncensored” models carry hidden probability reductions for charged terms, and those reductions survive fine-tuning. Your evaluation suite has blind spots.

If you’re evaluating dev tools: Zed’s Parallel Agents adds a Threads Sidebar for orchestrating multi-agent workflows with per-thread repo access. Meanwhile, Kuri reimagines browser automation for agents in Zig: tiny binaries, sub-5ms cold starts, ~16% fewer tokens per workflow.

If you care about governance: Meta’s Model Capability Initiative captures employee keystrokes, mouse movements, and screen content to train AI agents. Mandatory for some teams, opt-out for others. Whatever you think of the ethics, the data collection pattern is coming to your org next.

The Data Product Report is published every Tuesday by RepublicOfData.io.

Where does your business meaning live today: dbt Semantic Layer, LookML, Cube, MetricFlow, or somewhere else? Reply and tell us what your team is consolidating around. The best responses go in next week’s edition.

The Data Report

Discussion about this post

Ready for more?