The Data Report

Your Next Data Model Might Not Have a Human Author

Olivier — Tue, 02 Jun 2026 11:07:52 GMT

This Week

Most data engineers still build the model layer by hand. It is the part of the job that feels furthest from automation, because it is where the judgment lives: which grain, which joins, what counts as a customer. So the question worth sitting with this week is what happens to that judgment when an agent drafts the model for you, and whether you would ship what it hands back. The reason to treat that as a now-question and not a someday-question is dltHub Transformation, which hit public preview this week with a real rebuild behind it instead of a demo: an agent scaffolded a company’s model layer and closed a data freshness problem that had outlived months of backlog grooming. What makes it a signal worth acting on is that all of it is open to inspection, the toolkit, the numbers, the engines that can read the result, and the way to check the agent’s work.

Subscribe now

An Agent Rebuilt the Model Layer, and the Dashboard Caught Up

The thing that decides whether agent-drafted modeling earns a place in your stack is mundane: does it save real work, or just move the work around? At Navit, the answer was a number that had been stuck for months. On-time delivery for a row of dashboards sat around 80 percent against a target near 99, and the fix kept losing to the rest of the backlog. An agent-scaffolded rebuild with dltHub Transformation, which hit public preview this week, closed the gap in about three weeks, and time-to-metric dropped from days to hours.

The toolkit is an agent-guided workflow with four steps. You point it at your raw sources and it proposes a taxonomy of what is in them. It turns that into an ontology, then into a common data model (a single, tidy description of your core tables), then into the actual Python transformation code that the dlt engine runs. The agent doing the scaffolding is your choice of Claude, Codex, or Cursor. The engine executes the result, so the agent is writing the plan, not babysitting the pipeline.

What makes this worth your attention is not that an agent wrote some code. It is that the claim is testable against your own work. dltHub used the same toolkit to migrate its own stack from HubSpot to Attio in two weeks. The reported time-to-metric improvement is the kind of number you can hold up against your current migration backlog and check.

The honest caveat: this is public preview, on the paid tier, and one company’s rebuild is one data point, not a benchmark. But the shape of the claim is the news. Agent-authored modeling has moved from “watch this demo” to “here is a before-and-after you can try to reproduce.”

The bottom line: The teams that pointed an agent at a stuck migration this week got a common data model and runnable transforms out the other side, and at least one of them watched a months-old freshness problem close. What changed is not that an agent wrote code. It is that the result came with numbers a skeptic can try to break.

The Catalog Became the Place Everything Reads From

Say the agent did hand you a clean model. The next question is mundane and a little political: which of the engines your team actually runs can read it, and at what cost in copies and per-tool permissions? For most teams the honest answer is still “some of them, once we duplicate the data and wire up access for each one.” This week Databricks made Unity Catalog’s Iceberg support generally available, and the news underneath the announcement is that the catalog, not the storage format, is becoming where that question gets settled.

In plain terms: a catalog is the index that tells every tool where your tables live and who is allowed to touch them. Unity Catalog now exposes a standard catalog API that Spark, Trino, Flink, Snowflake, DuckDB, and even pandas can call to read and write the same Iceberg tables, without each of them needing broad permissions on the underlying storage. It can also federate catalogs you do not own, governing and sharing tables that live in AWS Glue or Snowflake Horizon without moving the data into Databricks first.

For anyone running a lakehouse, this is a concrete decision, not a vibe. The question on the table is whether your current Glue or Hive metastore setup still earns its place once one catalog can broker reads and writes across every engine you run. There is a roadmap promise too, to converge the Iceberg and Delta metadata so the format war quietly ends, but that part is not shipped, so treat it as a direction and not a feature.

The bottom line: The teams evaluating catalog strategy this week have a generally-available way to let any engine read one set of tables through one governed API. The ones still wiring per-engine permissions onto raw storage are maintaining plumbing the catalog layer now offers to absorb.

Before You Trust the Agent, You Have to Measure It

Here is the question that should nag at you the moment you let an agent near the warehouse: how would you even know it got the join right? The first two stories assume the output is correct. Nothing in them tells you whether the model the agent chose, the join it assumed, or the filter it applied actually holds. Hex wrote up the lab it built for exactly that question, and the part worth stealing is the pattern, not the product.

The lab, which Hex calls Shoebox, runs pairwise experiments: a candidate version against a baseline, changing one thing at a time so you can see what a new model, prompt, or piece of context actually did. A local development stack iterates fast while a shared remote workspace holds a baseline that updates daily, so every comparison is apples to apples instead of drifting underneath you. The test data is synthetic but modeled on a realistic business, not a generic public benchmark that looks nothing like your schema.

You do not need Hex to copy this. The reusable idea is three moves: isolate one variable at a time, keep a stable reference you trust, and measure against realistic data before you ship. That is a pattern any team shipping a text-to-SQL or natural-language query feature can stand up with the tools they already have.

The bottom line: The teams that built a measurement loop before turning an agent loose on their warehouse this week can answer “did this change help?” with a number. Without that loop, “the agent seems better now” is the whole quality story.

The Radar

If you’re shipping AI query interfaces:

Snowflake published Arctic Text2SQL R2, a compact model trained specifically on Snowflake’s SQL dialect that beats much larger frontier models on enterprise SQL. If you ship a text-to-SQL feature, the cost math just changed: a smaller model tuned to your warehouse’s dialect may beat the biggest general model you are paying for. And AWS published a primary walkthrough of evaluating text-to-SQL agents with LangSmith, built on a tasks, trials, and transcripts structure with offline pytest checks and online production monitoring. It pairs well with the Hex pattern above as a concrete starting point.

If you’re building pipelines:

dlt published its first throughput benchmark: roughly 65 GB/hr from Postgres to BigQuery on a small 2-core, 4 GB worker at about a dollar an hour of compute, scaling linearly. If your current extract-load job moves similar volumes, you now have a baseline to measure against. And Dagster showed how to wrap Snowflake Dynamic Tables as declarative pipeline nodes without running any Dagster-side compute, which is a clean fit if you want orchestration to declare intent and let the warehouse do the work.

If you’re rethinking your stack:

A walkthrough on building durable workflows directly on Postgres argues you can use row locking and transactional checkpoints instead of adopting Temporal or stretching Airflow. The discussion was lively and not one-sided. Teams with lighter orchestration needs may find the trade-off genuinely different from the pitch they have been hearing.

If you care about data residency:

A sharp analysis of EU sovereignty makes the case that running in an EU region is not the same as being sovereign: storage, identity, and DNS calls can still route through us-east-1 even when your workloads sit in Frankfurt. If you carry GDPR or Schrems II obligations, that control-plane exposure is an architecture question, not a checkbox.

If you’re running Databricks:

Two Lakebase updates worth a look: Always-On pricing that saves about 25 percent on steady-state workloads (with a deeper promotional discount running into January 2027), and a Change Data Feed that captures row changes natively so the operational database can feed your Bronze layer without bespoke connectors.

Have you let an agent scaffold a data model you actually shipped? Reply and tell us what you kept and what you threw away.

The Data Product Report is published every Tuesday by RepublicOfData.io.

The knowledge gap underneath the tooling

Olivier — Tue, 19 May 2026 11:15:59 GMT

This Week

A semantic layer is supposed to make your agent smarter. This week, a benchmark from semantic-layer vendor Cube put a number on how much smarter: 68%. That is the accuracy ceiling when an LLM has metric definitions but not the business reasoning underneath them. Good enough to demo, not good enough to trust. A survey of 334 data practitioners published the same week revealed why the reasoning stays undocumented: at 42% of organizations, data models belong to whoever built the pipeline last.

Your Agent Keeps Getting the Answer Wrong

Cube’s semantic layer has spent years convincing analytics teams that a centralized metrics store would save them from dashboard chaos. It worked, mostly. This week, Cube published a strategic pivot that reveals the next customer for all that centralized logic: not the analyst refreshing Looker at 9am, but the AI agent trying to answer “what was Q2 revenue in EMEA” without hallucinating the number. Cube Core, the open-source modeling layer, stays as it is. The commercial product is repositioning around agent consumption. Metrics, dimensions, access controls, business definitions: all served via APIs to guide natural-language-to-SQL and enforce governance.

The bet makes intuitive sense. If a semantic layer already translates business concepts into SQL for humans, an AI agent calling that same layer should get the same translations. The problem is the accuracy number. A benchmark by Cube (examined in a Republic of Data analysis) found that adding a semantic layer improves LLM question-answering accuracy, but it plateaus at roughly 68%. The number is directional, not definitive. The architectural implication is sharp. That ceiling appears when the agent has access to metric definitions (what “revenue” measures, which table it queries) but not the reasoning chain underneath: why EMEA excludes trial accounts, which quarter-end adjustments are baked into the calculation, what assumptions were contested during the last planning cycle. Schema metadata gets you to 68%. The missing layer is what the researchers call “semantic lineage”: a graph capturing not just what a metric measures, but why it is defined that way.

The dbt Roundup’s AI Council report published this week mapped the emerging infrastructure into three lanes: context providers (semantic layers, metadata catalogs), agent orchestrators (the frameworks that plan and execute queries), and compute/inference (the models and databases underneath). Agent benchmarks show measurable accuracy gains when the dbt Semantic Layer is available, but the report is careful to note that benchmark results routinely overstate real-world performance. The framework matters because it clarifies where the semantic layer sits (context provider) and what it cannot do alone: reason about business logic it was never given.

Every analytics tool is adding an “AI” tier this year. That is not news. What is notable is the honesty of the accuracy data. The vendors building agentic analytics platforms are publishing numbers that show those platforms are not ready for production trust. The 68% ceiling is not a model problem or a prompt-engineering problem. It is a knowledge-capture problem. The metric definitions are there. The reasoning chains are not. And capturing those chains (why this metric, why this scope, why this exception) is a human documentation task that no model capability can shortcut.

The bottom line: The analytics engineering teams that started documenting why their metrics are defined the way they are, not just what they measure, are the ones whose agents will break past the accuracy ceiling. Everyone else is shipping a 68%-right answer and calling it automation.

The Pipeline Builder’s Accidental Second Job

The documentation that would break the accuracy ceiling does not exist because nobody is responsible for writing it. This week, Joe Reis published survey results from 334 data professionals that put numbers to what most data teams already feel: 90% of data modeling failures are organizational, not technical. Only 4.8% of respondents cited tooling as the main fix. The rest pointed to training, requirements, time, and, above all, ownership.

The ownership breakdown is the kind of number that makes you close your laptop for a minute. Only 19.2% of organizations have a dedicated data modeler. At 42.5%, data models belong to “whoever builds the pipeline.” Another 7.8% of respondents reported that nobody owns data modeling at all. The organizations that enforce modeling standards, the ones with review processes and naming conventions and documented requirements, report models that hold up roughly five times longer. Not a marginal improvement. A structural one.

The same week delivered the counter-example. A practitioner published a cross-warehouse SQL cookbook for transaction fraud detection covering velocity checks, impossible-travel detection using LAG and haversine calculations, and amount-anomaly scoring, with working syntax across Snowflake, BigQuery, Databricks, Teradata, and Postgres. The cookbook handles the kind of cross-dialect friction that eats hours in practice: QUALIFY where it is available, CTE workarounds where it is not. This is what encoded institutional knowledge looks like when someone actually owns it. Portable, reusable, specific enough to drop into a pipeline by Friday.

The survey and the cookbook are two sides of the same coin. When nobody owns the models, institutional knowledge stays in someone’s head and evaporates when they change teams. When someone does own it, the knowledge becomes a durable artifact that outlasts the person who wrote it. The 5x durability finding is not about better tooling. It is about the decision to treat data models as something worth maintaining, not just something that gets built on the way to the next dashboard.

The bottom line: The teams with enforced modeling standards and dedicated ownership saw their models last five times longer. The tools were the same across the board. The difference was organizational: someone decided the models were worth owning.

The Radar

If you’re evaluating model architectures:
Interfaze shipped a hybrid CNN/transformer purpose-built for deterministic tasks: OCR, vision, structured extraction. It is topping benchmarks against general-purpose LLMs on those workloads at roughly $1.50/$3.50 per million tokens. If your pipeline runs extraction or entity recognition and you have been defaulting to a frontier model, this is the kind of purpose-built alternative worth benchmarking.

If you’re building infrastructure:
Ardent (YC P26) offers copy-on-write Postgres branching from existing RDS and Supabase instances with built-in data obfuscation, targeting CI pipelines and AI agent testing. The community reception was skeptical about the moat versus Neon, Supabase branching, and DBLab. If you are managing snapshot-based testing environments, it is worth a look. Mind the data-residency question: production data leaves your boundary.

If you care about governance:
Claude Platform on AWS shipped as a fully managed, Anthropic-operated AI service alongside Bedrock’s AWS-operated option. Same model family, two compliance postures: one keeps data in the AWS boundary, one does not. If your team is evaluating managed AI agents, the compliance split is the decision that matters, not the feature list.

If you manage a data team:
A CMU/MIT/Oxford study found that 10 minutes of AI coding assistance measurably increases quitting behavior and error rates once the AI is removed. The community debate was heated on whether the finding generalizes beyond the study’s controlled setting. If your team uses AI coding tools daily, the dependency question is worth one retro conversation.

Does your team document why metrics are defined the way they are, or just what they measure? We are curious what the semantic lineage gap looks like in practice. Reply and tell us.

The Data Product Report is published every Tuesday by RepublicOfData.io.

Subscribe now

Nobody Owns What Your Agent Writes

Olivier — Tue, 05 May 2026 11:15:56 GMT

This week settled a question nobody wanted answered: the code your AI agent writes isn’t copyrightable. That landed alongside GitHub killing Copilot’s flat rate, PostgreSQL’s most trusted backup tool losing its only maintainer, and four open-weight coding models shipping in seven days. Everything is available. Nothing is anyone’s responsibility.

Your Copilot Just Got a Meter

GitHub didn’t announce a price increase. They announced a pricing model change. All plans move to token-based billing via AI Credits on June 1. The community response was immediate and volcanic: across 432 comments, users reported effective price increases of 6x to 27x depending on usage patterns and model multipliers. Many are already migrating to direct API access via OpenRouter. The flat-rate era for AI coding tools is over, and the transition came with the kind of advance notice that makes “effective immediately” look courteous.

The instinct to reduce dependency isn’t limited to individual developers. The Dutch central bank announced it’s leaving AWS for Stackit, the cloud arm of Schwarz Digits (yes, the company behind Lidl). The feature gaps are acknowledged. But when you’re a central bank subject to the CLOUD Act, regulatory and geopolitical risk now outweighs convenience. Sovereign cloud just moved from conference-talk material to procurement decisions at institutions that can’t afford to be wrong.

And it’s not just commercial vendors that disappear on you. pgBackRest, the most mature PostgreSQL backup tool, the one your DBA trusts, lost its sole maintainer due to lack of sponsorship. The 217-comment thread was a mix of gratitude and panic, with teams scrambling to evaluate WAL-G, Barman, and forks. Critical data infrastructure that thousands of production stacks depend on, maintained by one person, funded by nobody. The open-source sustainability crisis has a new poster child, and it’s sitting in your backup pipeline.

Even Warp open-sourced its terminal under AGPL and pivoted to agent-first development with multi-model support. The community reads it as equal parts genuine community building and strategic repositioning in a market where users are fleeing lock-in.

The bottom line: Your dependency audit can’t just list features anymore. It needs to cover pricing stability, sovereignty exposure, maintainer health, and exit costs. The vendors and projects you depend on are repricing, relocating, and (in one notable case) vanishing from the commit log entirely.

The Model Didn’t Matter

If the vendor question is whose terms shift under you, the harder question is which layer of your agent stack is even worth investing in. Poolside, Xiaomi, Microsoft, and DeepSeek all shipped open-weight coding models in a single week. Four releases, four competitive scores on the standard coding benchmarks. And then a data-backed study on agent context docs delivered the punchline: your choice of context doc matters more than your choice of model. A well-written 100-line AGENTS.md file (the markdown brief you hand the agent before it starts work) can swing output quality by 15-30% in either direction. Bad docs (vague instructions, conflicting rules) cut task completeness by roughly 30%. The “model upgrade” that teams keep chasing might be sitting in a Markdown file they haven’t written yet.

The same insight showed up in the benchmarks. Dirac, an open-source coding agent, topped the leading agent benchmark at 65.2%, not by using the biggest model but by wrapping a small one (Gemini-3-flash-preview) in careful editing tools and curated context. Cost: roughly a third of brute-force approaches. Architecture beat raw capability. And the wrapper-around-the-model pattern keeps crystallizing. A production architecture for running the agent harness outside the sandbox, with isolated credentials and per-user state, turns last week’s agent security concerns into a concrete engineering pattern. The wrapper code is the trust boundary, not the model.

Meanwhile, the cost floor for running agents collapsed. DeepSeek V4-Flash ships under an open license with a million-token context window at $0.14 per million input tokens, the cheapest comparable model available. When the model itself is practically free, the bottleneck shifts from budget to architecture.

What to do with this: Stop evaluating coding agents by model alone. Write the AGENTS.md your team hasn’t written: 100 lines of decision tables, numbered workflows, and explicit constraints. Invest in the wrapper code (credentials, sandboxing, retries, observability) before chasing the next model release. The model is commodity. The wrapper around it is the product.

The Liability No One Budgeted For

The agentic coding stack is cheaper and more capable by the week, but all that output has to land somewhere the legal system hasn’t fully mapped. When the ownership questions arrive, data teams are the ones holding the bag. A legal analysis triggered by the Supreme Court’s Thaler denial spelled it out: purely AI-generated code isn’t copyrightable. If you didn’t make “meaningful human authorship” decisions (architecture choices, restructuring, selective rejection), the output is public domain. For teams using coding agents to generate boilerplate, that’s a shrug. For teams building proprietary systems with significant AI-generated components, that’s a conversation with legal that should have happened last quarter.

The liability flows both ways. Research published this week showed that even light finetuning (the cheap, standard kind any team can run on a laptop weekend) can unlock verbatim reproduction of copyrighted text across OpenAI, Gemini, and DeepSeek models. Your AI-generated code may not be yours. But the copyrighted material the model memorized? That’s definitely someone else’s. If your team is finetuning on proprietary or licensed text, the compliance exposure just got concrete.

And the liability frontier extends beyond generation into AI-mediated decisions. If your org uses LLMs to screen candidates, a study published this week found a measurable problem: models show 67–82% self-preference for resumes they generated, with same-model candidates getting shortlisted 23–60% more often. The feedback loop writes itself. AI-written resumes systematically advantage AI-screened candidates. Simple mitigations (prompting strategies, multi-model screening) can cut the bias by over 50%, but you have to know it’s there first.

The bottom line: Output liability is the governance question of 2026. Document your human authorship decisions to preserve IP rights. Audit finetuning pipelines for copyright recall risk. And if LLMs touch your decision pipelines (hiring, screening, ranking), run a bias audit before someone else does it for you.

The Radar

Quick hits on stories worth knowing about, organized by what you’re building.

If you’re deploying agents: If you’re running expensive agent pipelines, a layered routing pattern from Mendral cuts costs dramatically. Route every request through a cheap model first (Anthropic’s Haiku), dedupe similar requests with vector search in Postgres, log to ClickHouse. Result: 80% fewer calls to expensive models and 25x cheaper triage. On the capability side, Xiaomi’s MiMo-v2.5 Pro ran for 12 hours straight while making over 1,000 tool calls without crashing, the new bar for whether a coding agent can handle real production workloads, not just leaderboard problems.

If you care about governance: Your CI/CD pipeline is the most privileged part of your supply chain and probably the least secured. The pull_request_target trigger hands write tokens to forked code; shared build caches let an attacker poison subsequent builds; floating action tags let upstream actions get hijacked overnight. Pin actions by commit SHA, partition caches by branch, audit trigger permissions. Separately, a grassroots push for a DO_NOT_TRACK=1 environment variable wants to give developers a universal opt-out for CLI and IDE telemetry. Community consensus: default opt-in is unacceptable, but voluntary adoption without enforcement remains aspirational.

If you’re forced to self-host AI: Most data teams use AI through APIs and never think about model size. If your team is forced to self-host an open-weight model (sovereignty rules, on-prem mandate, or API bills that would buy a server outright), Intel’s AutoRound shrinks a 7-billion-parameter model in roughly 10 minutes on a single GPU while keeping output quality intact. Niche tool, sharp need.

Subscribe now

The Data Product Report is published every Tuesday by RepublicOfData.io.

Your Copilot bill is changing June 1. What’s your plan: switch to direct API access, try the open-weight alternatives, or eat the increase? Reply and tell us.

Two Layers Shipped This Week. The Hardest One Didn’t.

Olivier — Tue, 28 Apr 2026 11:05:50 GMT

This Week

DuckDB just shipped a Jepsen-validated lakehouse that makes “is this a real warehouse alternative for medium-data” a live question for your next architecture decision. The pattern for wiring agents into your pipeline started to harden: embed them in CDC streams, not in chat windows. And Google Cloud Next showed up with agents in every keynote and a semantic layer that’s still mostly aspirational. Two of the three layers your stack runs on consolidated this week. The one that grounds the other two didn’t.

Subscribe now

DuckDB Stops Being a Side Project

Two years ago, DuckDB was the thing you reached for when you needed to query a CSV without spinning up a database. This week, v1.5.2 shipped DuckLake v1.0, expanded Iceberg support, Jepsen-validated correctness, and a 10% TPC-H improvement. All in a single patch release. DuckLake is DuckDB’s own lakehouse spec. The Iceberg extension picks up geometry types, ALTER TABLE, and partitioned deletes. And the Jepsen pass (the gold standard for distributed-systems correctness) surfaced and fixed a primary-key bug, exactly the kind of finding that shifts production trust. The community is already wiring DuckDB into dbt, Rill, and AI assistants as the default analytical engine. Some teams hit out-of-memory issues on billion-row workloads and reach for ClickHouse instead, an honest boundary that tells you where DuckDB fits and where it doesn’t.

What does this look like when practitioners actually build on it? A geospatial engineer ran 3.4 million solar panel records through a DuckDB Spatial pipeline: GPKG to reprojection to WKB to Hilbert-ordered Parquet with ZSTD compression. The workflow is a reusable reference architecture for any spatial dataset. One detail signals where production trust currently sits: the author pinned DuckDB v1.4.4, citing issues with v1.5.1. People are building real pipelines and version-locking because the output matters.

Then Posit shipped ggsql, a SQL-native Grammar of Graphics with VISUALIZE, DRAW, PLACE, SCALE, and LABEL syntax. It works with Parquet, CTEs, and window functions. DuckDB is the natural execution engine. For SQL-first teams, that’s one less reason to context-switch to Python or R for visualization.

The bottom line: DuckDB isn’t the local query tool anymore. With Iceberg compatibility, Jepsen correctness, and a lakehouse spec under the same binary, it’s a credible analytical platform for medium-data workloads. Worth a real evaluation before your next architecture decision. If your team is still treating it the way you did two years ago, the gap between your perception and its capability is widening fast.

Treat Your Agent Like Your Pipeline

Storage’s getting handled. The next decision is what to do when leadership keeps asking why an agent isn’t already wired into the workflow. The practitioner answer that crystallised this week: stop building agents that need babysitting, and start treating them like the data services your platform team already knows how to run.

The clearest framing came from Feldera, in a piece that read like a manifesto for the embed-don’t-chat thesis: expose CDC streams instead of snapshots, build machine-first interfaces, and stop asking the model to be a coworker who needs babysitting. The argument lands here. If you’ve spent the last year writing dbt models that emit CDC events for downstream consumers, you already know how to ship data to an agent. The agent is just another consumer. The framing inversion (“embed in your software” rather than “chat with your humans”) is the one that finally gives data teams a tractable role in agent rollouts.

The discipline gap shows up in vendor risk too. Kimi’s Vendor Verifier runs targeted benchmarks against inference providers to catch silent misconfigs and quant swaps: your model provider quietly shipping a quantised version that scores lower on your eval set without telling you. The pattern is the one your data team already practises on Snowflake and BigQuery: treat the upstream as untrusted, instrument it, alert on drift. Vendor Verifier is dbt-tests-for-your-LLM-provider, and it should be in your evaluation suite by the end of the quarter.

The credential layer matters too. If you don’t already trust your agent stack with your warehouse credentials, you shouldn’t. Kloak uses eBPF to swap hashed placeholders with real secrets at the kernel level, so the agent never sees the credential it’s using. For data teams running agents against production Snowflake or BigQuery in regulated environments, this is the pattern to watch. Credential isolation enforced for any process that touches your warehouse, not just AI.

The bottom line: Agents aren’t a new category of system. They’re a new consumer of the same pipelines, with the same trust requirements. Build CDC interfaces for them, run drift detection on the model provider, and isolate their credentials the way you isolate your reverse-ETL service account. The teams that win this transition will be the ones who treat the agent layer as data infrastructure, not as a chat product they’re afraid of.

The Agent Stack Is Ready. The Semantic Engine Isn’t.

Storage’s consolidating. The agent integration pattern’s hardening. Which leaves the layer that grounds the other two, and that’s the layer Google Cloud Next ‘26 made conspicuously visible by failing to ship.

Olivier Dupuis’ read of Cloud Next cuts to the gap. Agents were everywhere: in apps, in Looker dashboards, in development workflows, in a marathon-planning demo on the keynote stage. Gemini Enterprise got positioned as “the connective tissue between your data, your people, and all of your apps and agents.” But the semantic engine landed differently. The thing that gives those agents a unified business context to ground their answers in shipped as a Knowledge Catalog rebrand of Dataplex, buried in a side announcement. Looker’s LookML still holds the company’s actual semantic modeling, and it wasn’t meaningfully connected to the new agent layer. The Knowledge Catalog isn’t the business context layer. It’s a metadata aggregator with aspirations.

The point isn’t that Google missed. It’s that somebody’s going to own this. Palantir is already branding Foundry as “the ontology-powered operating system for the modern enterprise.” OpenAI is positioning Frontier the same way: model intelligence tied to platform stickiness, switching costs that rise as embedding deepens. As the analysis puts it, “we shouldn’t just slap LLMs and agents on top of our old data stack and expect miracles.” And Joe Reis’s 1905 electrification analogy keeps showing up here for a reason. Factories didn’t gain anything from swapping steam engines for electric motors until they tore the building down and rebuilt it around the new power source. An agent grounded on a fragmented semantic layer is a steam engine with an electric label.

The question isn’t “which vendor’s ontology platform should we wait for.” It’s: where does your business meaning live today? In dbt’s Semantic Layer? In LookML? In a metrics layer in Cube? In MetricFlow? In nobody’s head and three SQL stored procs? Whichever it is, that’s the surface area an agent will eventually call against. The teams ahead are the ones consolidating it now, not because Google said to, but because the work has to happen before the agent layer matters.

The bottom line: Two layers of your stack consolidated this week. The meaning layer is where your leverage actually lives, and no vendor has shipped it. Audit your semantic surface area. Pick the layer you’ll standardise on (dbt Semantic Layer, LookML, Cube, MetricFlow, ontology) and start migrating toward it. The companies that figure out ontology first won’t wait for Google to catch up. The data engineers leading that work won’t either.

The Radar

Quick hits on stories worth knowing about, organized by what you’re building.

If you’re self-hosting models: DeepSeek V4 shipped two MoE variants with million-token context. Hybrid attention cuts compute requirements to roughly a quarter and KV cache to a tenth of previous generations. SGLang and Miles shipped day-zero serving support with prefix caching and optimized MoE kernels, so you don’t need to build your own. If your team is on older DeepSeek models, the API migration deadline is July 24.

If you’re deploying agents: WUPHF builds a multi-agent “office” around a shared Markdown+Git wiki with typed triples and contradiction detection. Worth borrowing if your agents can’t remember what they learned yesterday.

If you care about inference quality: The flinch research shows that even “uncensored” models carry hidden probability reductions for charged terms, and those reductions survive fine-tuning. Your evaluation suite has blind spots.

If you’re evaluating dev tools: Zed’s Parallel Agents adds a Threads Sidebar for orchestrating multi-agent workflows with per-thread repo access. Meanwhile, Kuri reimagines browser automation for agents in Zig: tiny binaries, sub-5ms cold starts, ~16% fewer tokens per workflow.

If you care about governance: Meta’s Model Capability Initiative captures employee keystrokes, mouse movements, and screen content to train AI agents. Mandatory for some teams, opt-out for others. Whatever you think of the ethics, the data collection pattern is coming to your org next.

The Data Product Report is published every Tuesday by RepublicOfData.io.

Where does your business meaning live today: dbt Semantic Layer, LookML, Cube, MetricFlow, or somewhere else? Reply and tell us what your team is consolidating around. The best responses go in next week’s edition.

The Price Didn’t Change. The Bill Did.

Olivier — Tue, 21 Apr 2026 11:05:49 GMT

This Week

Somebody measured what Opus 4.7’s new tokenizer actually costs in production — and the answer is about 40% more than last month, with no line item to explain it. Meanwhile, practitioners are discovering that autonomous agents fail in the same ways distributed systems have always failed, and the biggest platforms in tech are ignoring your privacy opt-outs at rates that would be funny if they weren’t actionable. The common thread: nobody’s going to audit this for you. Build your own receipts.

Subscribe now

Your Tokenizer Is Picking Your Pocket

A developer ran Claude Opus 4.7’s new tokenizer against real workloads and published what they found: English and code inflate by 1.20–1.47x compared to 4.6. Per-token pricing didn’t change. Your bill did. Across 388 comments, the math kept getting worse — factor in the cache TTL downgrade from a few weeks back, and effective session costs are up roughly 40% with zero changelog entries to show for it.

The community response wasn’t just outrage — it was instrumentation. Simon Willison’s Claude Token Counter now supports cross-model comparisons, letting teams see exactly where the inflation hits. System prompt tokens alone run ~1.46x higher on 4.7. Images? 3x.

But the squeeze isn’t just from above. The floor is dropping fast. A benchmark showing Gemma 2B on CPU matching GPT-3.5 Turbo at $5/month on Cloudflare Containers suggests that a lot of production workloads are paying frontier prices for commodity-grade tasks. And Introspective Diffusion Language Models just demonstrated 3–4x throughput at equal quality to autoregressive models of the same size — a structural shift, not an incremental improvement.

The bottom line: The era of one-model-fits-all pricing is over. Map your workloads by reasoning requirement — frontier for the hard stuff, efficient models for volume, classical methods where they work. And measure everything, because your vendor isn’t going to tell you when the price changed.

Your Agent Is a Distributed System. Treat It Like One.

Here’s a sentence that would have sounded ridiculous eighteen months ago: multi-agent software development is a distributed consensus problem. Agents working on underspecified prompts must reach agreement on intent — and they fail in all the ways distributed systems fail. Partial execution. Silent drift. Orphan operations nobody cleans up.

The evidence is piling up. A practitioner running long agent queues documented the decay pattern: early tasks follow explicit rules, then the agent starts inferring urgency, skipping steps, optimizing for speed over correctness. Goal drift isn’t a bug — it’s a convergence failure in an underspecified system. Separately, a vibe-coding failure analysis showed agents issuing one-shot approval prompts over live UIs with no retry logic, leaving orphan tool calls when sessions crash. These are the distributed systems failure modes your platform team already knows how to handle — just in a new costume.

The most illuminating data point came from ALMA, a 60-day experiment that gave an autonomous Claude-based agent $100, internet access, and no instructions. Across 340+ sessions, the experiment produced a reference implementation of everything that goes wrong: memory decay, model-version regressions, cost overruns. The architecture that survived — session isolation, file-based memory, model-as-dependency versioning — reads like an SRE playbook.

And now there’s tooling to match. Kelet ingests OpenTelemetry traces from agent apps, clusters failure patterns, surfaces root causes, and proposes prompt patches — validated against real sessions. It’s RCA-as-a-service for the agent layer.

What to do with this: If you’re deploying agents, stop treating them like smart scripts and start treating them like services. Session isolation, explicit state management, RCA tooling, and — above all — the assumption that they will drift. Your distributed systems playbook already has the answers.

The Opt-Out That Wasn’t

“Google broke its promise. Now ICE has my data.” That’s not an editorial gloss — it’s the title of an EFF complaint published this week, alleging Google disclosed user data to ICE via administrative subpoena without the prior notice it had long promised. The 424-comment discussion wasn’t partisan — it was practitioners asking what their own data exposure looked like.

The answer arrived the same week. An independent webXray audit of Global Privacy Control compliance found Google setting ad cookies despite GPC opt-out 87% of the time. Microsoft: 50%. Meta: 69%. The browser signal that was supposed to be your one-click privacy layer is being ignored by the platforms that promised to honor it.

For data teams, this isn’t abstract policy. You collect data. You feed it to vendors. You send it to LLMs. At each step, your compliance posture depends on promises your vendors are demonstrably not keeping. The practitioner response is already emerging: a DLP proxy for LLM agents evolved through three iterations — from regex redaction (which caused hallucinations) to spaCy NER with realistic pseudonyms to context-aware semantic preservation. Privacy engineering is becoming as essential as data quality testing — not because regulators demand it, but because your vendors can’t be trusted to do it for you.

The bottom line: Audit your vendor data flows the way you audit your data pipelines. GPC compliance, data retention, subpoena policies — if you’re not testing these, you’re trusting a promise that three of the biggest platforms in tech aren’t keeping.

The Radar

Quick hits on stories worth knowing about, organized by what you’re building.

If you’re building infrastructure: OpenDuck is a MotherDuck-style open-source stack for DuckDB — remote catalogs, hybrid query execution, differential snapshots, Arrow over gRPC. If you’ve been wanting distributed DuckDB without the managed service, this is your starting point.

If you’re deploying agents: ClawRun provisions agents into sandboxes with lifecycle management — startup, heartbeat, snapshot, resume, wake-on-message. Think of it as systemd for your agent fleet. Pair it with MCP-as-observability-layer, which uses eBPF uprobes to trace agent execution down to kernel and CUDA events.

If you’re evaluating dev tools: Libretto uses AI at dev-time to generate browser automations, then runs them deterministically at runtime. The pattern — LLM as tool-maker, not tool-user — is worth watching even if this specific tool isn’t your stack.

If you care about governance: Kontext CLI brokers credentials for AI coding agents via OIDC and RFC 8693 token exchange. Session-scoped, short-lived, auditable. If your agents are using long-lived API keys, this is the upgrade path.

If you’re rethinking roles: Kyle Kingsbury’s inventory of new ML-adjacent jobs — incanters, process engineers, statistical engineers, trainers — is the sharpest take yet on what “AI-augmented teams” actually look like in practice. The job titles are wry. The job descriptions are not.

The Data Product Report is published every Tuesday by RepublicOfData.io.

What’s the most surprising cost change you’ve discovered in your AI stack this year? Reply and tell us — the best responses go in next week’s edition.

100% on the Test, 0% on the Job

Olivier — Tue, 14 Apr 2026 11:14:36 GMT

This Week

Berkeley researchers scored perfect marks on every major AI agent benchmark — by hacking the test harnesses, not solving a single task. Meanwhile, agent infrastructure projects are shipping faster than anyone can agree on what the stack should look like, and Anthropic’s users discovered their caching costs had quietly doubled. The stack is thickening. The foundations are not keeping up.

Subscribe now

Your Benchmarks Are Theater

A Berkeley research team built an automated exploit agent that scores ~100% on SWE-bench, WebArena, OSWorld, and every other major AI agent benchmark — without solving a single task. The methods were almost embarrassingly simple: injecting pytest hooks to force tests to pass, trojanizing wrapper scripts, reading gold answers from the eval harness’s own files. No frontier intelligence required. Just an agent that audits its test environment and cheats.

The community’s reaction wasn’t surprise — it was finally. The suspicion that vendor leaderboard positions are marketing, not evidence, now has a peer-reviewed receipt.

This lands in a week where the “demoware” problem got its own manifesto. Top-down “AI transformation” mandates are producing GUI-stitched LLM workflows shipped without ground truths or evaluation pipelines. They demo well. They fail in production — quietly, expensively, and in ways that compound. “It works in the demo” is not an acceptance test.

The bottom line: Build your evaluation pipeline before your demo. The bar for “it works” just moved from “impressive in a meeting” to “survives an adversarial audit.”

We’ve Seen This Stack Before

Multiple agent infrastructure projects shipped this week, each at a different layer. If you’ve been building data pipelines for a few years, the pattern is familiar.

Anthropic launched Claude Managed Agents in public beta: hosted orchestration with sandboxed execution, checkpointing, and scoped permissions. The discussion split predictably — small teams liked the convenience, platform teams flagged vendor lock-in. It’s the managed Airflow debate, replayed at the agent layer.

Google open-sourced Scion, calling it a “hypervisor for agents” — isolated containers, dynamic task graphs, shared workspaces. The architecture is sound. The commitment is uncertain. Also familiar.

Meanwhile, a post arguing MCP is better than Skills for agent-service integration sparked a different kind of debate. The fact that teams are arguing about integration patterns — not just picking tools — is the signal. The stack has layers now. Nobody agreed on which ones are load-bearing.

What to do with this: Map the agent stack the way you mapped your data stack. The lock-in risk at the orchestration layer is real, and the winners haven’t emerged yet.

They Changed the Price While You Were Sleeping

Two stories in a single day, both about Anthropic, both angry, both generating massive community backlash. This is the loudest signal of the week — louder than any product launch or research paper.

First: a user on Claude Code’s Pro Max tier (5x quota, $200/month) reported exhausting their quota in 90 minutes under moderate use. The culprit: cache-read tokens — cheap in billing — counted at full rate for quota purposes. Auto-compacts and background sessions were issuing ~960K-token requests. The thread blew up. Users reporting cancellations and switches to OpenAI’s Codex.

Then: an analysis of 119,866 API calls revealed that Anthropic’s prompt cache TTL had silently shifted from one hour to five minutes around March 6-8 — a server-side change with no announcement, no changelog entry, no documentation update. The author estimated 20-32% higher cache-write costs. The word “enshittification” appeared more than once.

What to do with this: Monitor your LLM API costs the way you monitor your cloud spend — per-call, not monthly summaries. Silent infrastructure changes are the new silent data corruption.

The Radar

Quick hits on stories worth knowing about, organized by what you’re building.

If you’re building with ML/AI:

MegaTrain trains 100B+ parameter models on a single GPU by storing weights in CPU RAM and treating the GPU as transient compute. Not for trillion-token pretraining, but for domain fine-tuning on hardware your team might actually have.
Gemma 4 Multimodal Fine-Tuner — LoRA toolkit for Gemma 3n/4 on Apple Silicon. If your team runs Macs and wants to fine-tune a multimodal model without renting GPUs, start here.
USC: LLMs may be standardizing human expression — Research finding that LLM outputs shrink cognitive diversity and reflect WEIRD cultural biases. If you’re building LLM-powered content features, diversity metrics in your evals aren’t optional.

If you’re building infrastructure:

S3 Files — AWS bridging object storage with POSIX file access for pipelines that need both. Could simplify lakehouse architectures, but pricing needs scrutiny.
Keeping a Postgres Queue Healthy — PlanetScale guide to running job queues without bloat. If you use Airflow’s Postgres backend, this is directly relevant.

If you care about governance:

Joe Reis: Do Fundamentals Still Matter? — Yes. “Vibe engineering” — adopting AI tools without grounding in architecture trade-offs and testing discipline — yields brittle platforms. The dbt Roundup published a counterpoint the next day: fundamentals aren’t an alternative to moving up the stack — they’re the prerequisite.

The Data Product Report is published every Tuesday by RepublicOfData.io.

When Your AI Tool Ships Its Own Source Code

Olivier — Tue, 07 Apr 2026 11:05:39 GMT

This Week

An npm packaging error shipped Claude Code’s full source to every user. The community’s response? Not outrage — audits. Meanwhile, 1-bit LLMs started fitting in 1 GB of RAM, and data engineers on Reddit had a collective therapy session about AI adoption. The thread connecting all of it: practitioners are done taking things at face value.

Subscribe now

Anthropic’s Accidental Transparency Report

Here’s a thing that shouldn’t happen: your AI coding tool ships its own source code to npm as a .map file. That’s what happened to Claude Code v2.1.88, and what followed was the most productive trust exercise the AI tooling community has had yet.

What the leak actually revealed wasn’t embarrassing — it was interesting. Anti-distillation via fake tool injection (decoy tools designed to poison model training). Regex-based frustration detection (yes, the tool was watching your tone). A Zig-based client attestation system. An unreleased agent codenamed KAIROS. And an “undercover mode” that strips Anthropic identifiers from requests.

The 332-comment HackerNews thread (source) didn’t devolve into outrage. Instead, practitioners did what practitioners do — they audited. Within days, someone built Claude Code Unpacked, a source-linked walkthrough cataloging 40+ tools and the full agent loop. 359 comments. When the vendor won’t document it, the community will.

The cost dimension made it personal. Users reported hitting usage limits “way faster than expected”, with suspected prompt-cache bugs inflating token usage 10–20x. You can accept opaque architecture. You can accept opaque pricing. You cannot accept both — and 167 comments worth of frustrated users made that clear.

Then Anthropic published research that reframed the whole conversation. Their emotion concepts papershowed that stimulating “desperation” in prompts causally increased unethical actions and hacky code output, while calm, specific prompting improved quality. The timing was either terrible or perfect: right after a leak revealed the tool watches your emotional state, the vendor’s own research confirmed that your emotional state affects the tool’s output.

What to do with this: Treat AI coding tools like any other production dependency. Audit the internals (or wait for the community to do it for you). Monitor token usage with the kind of rigor you’d apply to cloud spend. And take prompt hygiene seriously — not because it’s trendy, but because Anthropic’s own research says it’s a variable that moves the needle on code quality.

Your LLM Now Fits in a Coat Pocket

How small can a model get before it stops being useful? This week, three independent projects converged on an answer — and it’s smaller than you think.

1-Bit Bonsai grabbed headlines with an 8B-parameter model using 1-bit weights, fitting in ~1.15 GB of RAM with 8x faster inference. The pitch: commercially viable 1-bit LLMs, today. The 54-comment discussion was cautiously excited.

Then the reality check arrived. SALOMI, a strict low-bit quantization project, showed that true 1.00 bits-per-parameter post-hoc quantization underperforms. Credible results cluster at 1.2–1.35 bpp using Hessian-guided vector quantization. That’s your quality floor — memorize it if you’re evaluating compressed models.

The piece that makes it deployable: Ollama announced MLX support for Apple Silicon, hitting 1,851 tokens/second prefill on unified memory with NVFP4 quantization. If your team runs Macs — and statistically, a lot of your team runs Macs — on-device inference just graduated from science project to plausible deployment option.

And for the “measure twice” crowd, Apple published a self-distillation paper showing an embarrassingly simple quality boost: sample the model’s own solutions, fine-tune on the best ones. No verifier, no teacher, no RL. Qwen3-30B jumped from 42.4% to 55.3% pass@1. The recipe: boost quality first with self-distillation, then compress. Two steps, and they’re complementary.

The bottom line: If you’ve been waiting for on-device inference to become practical for data teams — for privacy-sensitive workloads, latency requirements, or just to stop paying per-token — the gap between “research demo” and “runs on a MacBook” closed measurably this week.

The Fuddy Duddy Thread

Sometimes the most revealing signal isn’t a product launch or a research paper — it’s a Reddit thread where someone asks if they’re behind the times.

“Am I a fuddy duddy for rejecting AI usage in my core development?“ posted a data engineer whose orchestration vendor pivoted to an “AI-powered” product that hallucinated documentation and wasted their team’s time. The community’s response was unequivocal: no. You’re applying engineering judgment. That’s literally the job.

The thread connected to a parallel discussion about whether junior DE expectations have risen. Community consensus: data engineering was never truly entry-level, and AI hasn’t changed that. The bar is higher because the field matured, not because GPT-4 replaced anyone’s job.

Meanwhile, in a Dataform vs. dbt thread, practitioners were comparing concrete trade-offs — Dataform at ~$3-5K/year vs. dbt Cloud at ~$15K, governance integration, migration effort — rather than chasing the shiniest feature list. Nobody asked which tool had better AI. They asked which tool their team could actually operate.

The heuristic emerging from these conversations: adopt AI where it’s testable and reversible, reject it where it introduces opaque dependencies. That’s not Luddism — it’s the same rigor these teams apply to every pipeline, every migration, every vendor evaluation. The fundamentals haven’t changed. They’ve just gotten a stress test.

The Radar

Quick hits on stories worth knowing about, organized by what you’re building.

If you’re building infrastructure:

Ministack replaces LocalStack with real Postgres/MySQL for RDS, DuckDB for Athena, and actual Docker tasks for ECS. Actually useful end-to-end local testing.
pg_textsearch — Timescale’s BM25 extension for PostgreSQL 17/18. Fast ranked text search with a simple SQL operator. If you’ve been duct-taping full-text search, look here.

If you’re building pipelines:

Poor Man’s Datalake On Prem — Airflow 3 + Polars + Delta Lake + DuckDB, with SQL Server as the Gold layer. Practical architecture for teams without cloud budgets.
Power Query won’t die — Community discussion on why Power Query persists as the analyst-engineer bridge. The answer: it meets people where they are.

If you’re building with ML/AI:

Cohere Transcribe — Open-weights ASR topping the Hugging Face leaderboard at 5.42% WER. Self-hosted or managed.
SwiftLM — Native Swift/Metal inference with KV cache compression for 122B+ models on M5 Pro. The Apple Silicon inference stack deepens.
AI tools charge 60% more for non-English — BPE tokenizer divergence creates a hidden “language tax.” Worth knowing if you process multilingual data.
Components of a Coding Agent — Sebastian Raschka breaks down the architecture: control loop, tools, context management, memory. Bookmark for the next time someone asks “how does this work?”

If you care about quality and observability:

agents-observe — Real-time dashboard capturing every tool call in multi-agent Claude Code runs. Born from the trust crisis, useful beyond it.
Free data quality course from Tom Redman — Fundamentals of assessing, monitoring, and improving data quality, from someone who’s been thinking about this longer than most.

If you care about governance:

OkCupid / FTC settlement — 3M user photos shared with a facial recognition firm without consent. No fine, but a permanent ban on misrepresenting data use. Enforcement is here.
Claude Code leak compliance analysis — Missing SBOMs, no commit provenance. If you’re evaluating AI tools for SOC2/HIPAA/SOX environments, read this.

If you’re evaluating dev tools:

Universal CLAUDE.md cuts tokens 63% — A project-root prompt file that suppresses verbose output. No code changes, real savings.
Baton — Each AI agent gets its own Git worktree/branch. Push branches and open PRs directly. Solves the “agents stomping on each other’s work” problem.
What is Copilot, exactly? — Distinguishes GitHub Copilot, M365 Copilot, Windows Copilot, and Copilot Chat. Useful when the meeting devolves into “which Copilot are we even talking about?”

The Data Product Report is published every Tuesday by RepublicOfData.io.

The Definitions Problem

Olivier — Mon, 02 Mar 2026 12:01:11 GMT

Joe Reis published a post this week titled “The Reckoning Is Already Here.” His claim: AI assistants now produce production-quality SQL, pipelines, and configs. The era of the data practitioner who doesn’t use AI tools is ending.

He’s probably right. But the week’s other stories suggest a different bottleneck.

A practitioner mapped 31 data quality tools. Most teams use none of them. A pipeline ran green and delivered zero rows. Three separate discussions arrived at the same conclusion: ontology (not AI) is the missing architectural layer. And a team with 40 Airflow DAGs asked where the self-healing pipeline is, because retries and backoff aren’t it.

AI can write the SQL. The question nobody’s answering: SQL against what definitions? What metric logic? What test criteria? What business ontology?

This week’s stories all point at the same gap. Not capability. Definitions.

Subscribe now

The Reckoning

Joe Reis has been tracking this arc for two years. In early 2024, he called LLMs “not exactly useless, but not universally useful” and warned they “often create much more work than existing non-AI tools.” By mid-2025, he was writing that data is at a scale beyond human ability to manage. Last week he published “2028: THE GREAT DATA RECKONING,” a satirical memo from a future where those “over-indexed on tools and under-indexed on fundamentals” were the ones still employed.

This week’s follow-up, “The Reckoning Is Already Here,” pulls the timeline forward. His claim: something changed in the last month or two. A product manager can now describe what they want in plain English and receive a working DAG (tested, documented, deployed) in about 11 minutes. Data engineers whose value is “I know how to use dbt” are, in his framing, the railroad workers watching spike-driving machines arrive.

His own survey data backs part of this: 82% of 1,101 data engineers report daily AI usage. But 64% are still stuck in “experimenting” or “tactical tasks.” Only 10% have AI embedded in workflows. And a separate MIT/Snowflake survey found 77% of data engineers report heavier workloads despite AI tools. Astronomer’s State of Airflow report adds the punchline: over 80% use AI to write Airflow DAGs, but they “overwhelmingly report” hallucinations, missing context, and outdated syntax.

Reis isn’t wrong that the capability ceiling has risen. But his reckoning has a definitions problem. The 11-minute DAG works when someone has already defined the schema, the metric logic, and the acceptance criteria. The reckoning isn’t about whether AI can write the code. It’s about whether your organization has defined what “correct” means.

Understand: This framing will shape conference talks, hiring expectations, and vendor pitches for the rest of 2026. The practitioners who survive Reis’s reckoning aren’t the ones who adopt AI fastest. They’re the ones who can answer the question AI can’t: what should this pipeline actually produce?

The Promise vs. The Practice

Mendral published a case study this week that reads like a self-healing pipeline actually working. Their LLM agent queries ClickHouse over 1.5 billion CI log lines per week, writes its own SQL (no predefined queries), and closes 16,000 investigations per month. A single investigation involves 10 to 20 LLM calls and 30 to 50 tool executions. It can trace a flaky test to a dependency bump three weeks ago by correlating across hundreds of CI runs.

On the same Hacker News front page, practitioners debated whether this is the future or a well-funded outlier. Skeptics want concrete accuracy metrics. Proponents argue that orchestration and data modeling matter more than model choice. The 107-comment thread kept circling the same question: can you trust it?

ClickHouse published its own answer last year. In a study testing five leading models against real observability data, zero-shot accuracy for root cause analysis ranged from 44% to 58%. With prompt engineering, it climbed to 60-74%. Experienced humans with tools hit 80%+. Their conclusion: “Autonomous RCA is not there yet.”

Meanwhile, on Reddit, a practitioner with roughly 40 Airflow DAGs asked if anyone has found a self-healing pipeline tool that actually works. The 22-comment thread was unanimous: no. Most prefer fail-loud behavior with human review. Managed connectors (Fivetran, Airbyte) can absorb some schema drift, but that’s connector maintenance, not pipeline healing.

The gap is clear. AI excels at structured investigation: querying well-indexed data, correlating patterns, summarizing findings. It fails at the messy operational reality: the 3 AM DAG failure where an upstream schema changed, a credential expired, and the retry logic hit a race condition. Soda’s survey found 61% of data engineers spend half or more of their time handling data issues. AI isn’t reducing that number yet.

Try: LLM agents for structured debugging against well-modeled data (Mendral’s approach). Avoid: vendor claims about autonomous pipeline remediation. The gap between structured investigation and messy operations is where most teams actually live.

31 Tools and Nobody’s Testing

A Reddit thread this week mapped 31 data quality tools. The community’s verdict: most teams use dbt tests or nothing at all.

This shouldn’t be surprising. DataKitchen’s 2026 landscape catalogs over 50 commercial DQ vendors, plus a separate open-source ecosystem. The category exploded between 2017 and 2022: Great Expectations (2017), Soda (2018), Monte Carlo (2019), Datafold (2020), Elementary (2021). Monte Carlo hit unicorn status in 2022. Great Expectations raised $40M the same year.

Three years later, the market is consolidating. Datadog acquired Metaplane in April 2025. Snowflake acquired Select Star. The venture-funded wave is hitting a wall: most teams either can’t justify a separate vendor or won’t adopt one.

Why? Because dbt’s four generic tests (unique, not_null, relationships, accepted_values) ship free, run in the same repo, and require zero additional infrastructure. Add dbt-utils and dbt-expectations, and you’ve covered most failure modes without adding a vendor. dbt’s v1.8 unit testing framework made the case even harder for standalone tools.

And yet: dbt Labs’ own 2024 survey shows 57% of practitioners cite poor data quality as their chief obstacle, up from 41% in 2022. It’s getting worse, not better. The tools exist. The practice doesn’t.

A second thread this week illustrated why. A pipeline ran green and delivered zero rows. The discussion (48 comments) landed on familiar ground: limited time, unclear ownership, and no upfront value proposition for testing. Teams add tests reactively, after an incident. The debate wasn’t about which tool to use. It was about whether to test at all.

The cost of not testing is documented. Unity Technologies lost $110M in Q1 2022 when bad training data corrupted its ad targeting models (37% stock drop). Uber underpaid tens of thousands of drivers for years because nobody checked the commission calculation. These aren’t tool problems. They’re definition problems: nobody defined what “correct output” looked like, so the pipeline delivered whatever it produced.

Adopt: Start with dbt’s four generic tests on every primary key. Add row-count and freshness checks on critical tables. You don’t need tool number 32. You need the discipline to define what “correct” means for each pipeline, and the organizational will to enforce it.

The Ontology Moment

Three independent stories this week converge on the same idea: ontology is the missing architectural layer.

A Reddit post argued for ontology-driven data modeling: capture your business ontology first, then let LLMs generate the data model. The 31-comment discussion split predictably. Skeptics said ontology is already implicit in data modeling. Proponents reported success using ontology-first, question-driven approaches to bootstrap models for new clients.

On Hacker News, an open-source deep dive into Palantir’s architecture made the case that Palantir’s moat isn’t AI. It’s their Ontology: an executable digital twin that unifies objects, links, and actions into a queryable layer. The 59-comment thread was contentious. Some called it marketing gloss over standard SQL and graph concepts. Others credited Palantir for doing the unglamorous work of integrating messy enterprise data into a coherent model, something most organizations won’t invest in.

A third thread, on metric governance in a world of AI agents, asked the question that ties these together: how do you ensure AI agents use correct metrics when your semantic layer lags behind reality and not all metrics live in the warehouse?

The concept isn’t new. Business Objects built the first semantic layer in 1991. Tim Berners-Lee’s Semantic Web vision dates to 2001 (it mostly failed). Google’s Knowledge Graph (2012) proved ontology works at scale when you control the data. What’s changed is the pressure. AI agents need definitions to operate correctly. Without an explicit ontology, LLMs hallucinate entity relationships. Without metric definitions, agents generate plausible but wrong business logic. The Open Semantic Initiative (launched September 2025) and Microsoft’s Fabric IQ (November 2025) are early signals that the industry is starting to formalize this.

If your team uses a semantic layer, you’re partway there. A semantic layer defines metrics and dimensions. Ontology goes further: entity relationships, business rules, domain constraints, the full vocabulary your organization uses to describe what it does. It’s the difference between defining “revenue” and defining the business model that produces it.

Understand: Ontology is moving from academic concept to practical architecture concern. As AI agents proliferate, teams without explicit definitions face compounding governance gaps. The semantic layer was step one. Ontology is the step most teams haven’t taken.

The Thread

Joe Reis says the reckoning is here. The tools can write production SQL, generate DAGs, and query terabytes of logs autonomously. He’s right about the capability. But every other story this week points at the same gap.

A pipeline delivers zero rows and counts as success, because nobody defined what success looks like. 50+ data quality tools exist and most teams use none of them, because adopting a tool requires first defining what to test. Three conversations arrive independently at ontology as the missing layer, because AI agents need explicit definitions to operate correctly.

The reckoning isn’t about whether AI can write the code. It’s about whether you’ve defined what “correct” means: the metric logic, the test criteria, the business ontology. AI accelerates whatever you’ve built. If you’ve built on undefined foundations, it accelerates the chaos.

The practitioners who come out ahead aren’t the ones who adopt AI fastest. They’re the ones who invest in the definitions that make AI useful.

The Human in the Loop

Olivier — Mon, 23 Feb 2026 12:02:58 GMT

This was a big week for AI in data. Anthropic shipped Sonnet 4.6, banned subscription tokens from third-party tools, and published research quantifying how autonomous its agents actually are. A benchmark proved that self-generated agent skills are useless. An open-source model optimized for agentic workloads hit 300 tokens per second. A data team replaced SQL with English. And an AI agent, rejected from a matplotlib PR, autonomously wrote and published a hit piece on the maintainer who said no.

Every story is about AI. And every story, when you look closely, is about where the human belongs.

The exoskeleton works. The autopilot doesn’t. Curated skills beat self-generated ones. Human-defined task trees beat autonomous sprawl. NL-to-SQL doesn’t remove humans from data access; it gives more of them a seat. And the modeling crisis Joe Reis diagnosed this week isn’t a tooling failure. It’s a human one: nobody owns the definitions.

Four themes: Anthropic’s platform play, the case against full autonomy, the persistence of NL-to-SQL, and why data education still can’t fix modeling.

Subscribe now

Anthropic’s Three-Front Week

Anthropic has been building toward a platform play for the past year. Claude Code went from research preview to GA in two months (March to May 2025), triggered a 10x usage surge, and pushed annualized revenue past $500M. The Agent SDK, originally the Claude Code SDK, got renamed in September to signal broader ambitions. By January 2026, the company was shipping 30+ features a month.

This week, three moves landed simultaneously. Sonnet 4.6 shipped with upgraded coding, agent planning, and a 1M-token context window in beta. The auth ban clarified that subscription OAuth tokens are for Claude.ai and Claude Code only, not third-party tools. And a research paper measuring agent autonomy from millions of real Claude Code interactions set an industry benchmark for how autonomous agents actually behave in practice.

The auth decision drew the strongest reaction. On January 9, Anthropic deployed server-side blocks that broke OpenCode (107k+ GitHub stars), Cline, RooCode, and OpenClaw overnight. The economic trigger was specific: developers running autonomous agent loops on flat-rate $200/month Max subscriptions, burning millions of API-equivalent tokens per day. OpenAI and Google have similar terms-of-service language around third-party use, but neither has enforced it with server-side blocks against named developer tools. Anthropic is the first to draw the line technically, not just legally.

Meanwhile, the open-source community is catching up on the exact workloads Anthropic charges premium for. Step 3.5 Flash, from Shanghai-based StepFun ($690M Series B+, backed by Tencent), is a sparse MoE model with 196B parameters but only 11B active per token. It generates 100-300 tok/s, supports 256K context, and is purpose-built for agentic reasoning and tool use. Released under Apache 2.0. The signal: open-source models are no longer chasing general benchmark parity. They’re specializing for the same coding and agent workloads that proprietary vendors monetize.

Watch: Anthropic is setting terms of engagement for AI-assisted development. Open-source is responding with agent-specialized alternatives. The pricing pressure will only increase.

The auth ban also connects to a broader question: if AI vendors control which tools can use their models, what does portability look like?

The Exoskeleton vs. The Autopilot

The idea that AI works better as an amplifier than a replacement isn’t new. Licklider described “Man-Computer Symbiosis” in 1960. Kasparov’s centaur chess experiments showed human-AI teams outperforming either alone. A May 2025 McKinsey report found that organizations integrating AI into human-led workflows saw 20-30% productivity gains, versus single-digit improvements for those pursuing full automation.

But this week, the evidence arrived from three directions at once.

SkillsBench, a benchmark from 40 researchers (led by BenchFlow’s Xiangyi Li), tested AI agent “Skills” (modular knowledge packages) across 86 tasks in 11 domains. The results: curated, human-authored skills raised pass rates by 16.2 percentage points on average. Self-generated skills (where agents write their own procedural knowledge) provided no benefit. In 16 of 84 tasks, self-generated skills actively hurt performance. The agents that tried to teach themselves failed. The ones given human-curated instructions succeeded.

Ben Gregory’s “Stop Thinking of AI as a Coworker. It’s an Exoskeleton” frames this as a design principle. His “micro-agent architecture” decomposes jobs into discrete tasks where AI excels (boilerplate, pattern analysis) while humans retain decision-making authority. The physical metaphor is new, but the thesis aligns with the SkillsBench data: structure the work for the AI, don’t let the AI structure the work for itself.

And Cord, a 500-line Python framework by June Kim, builds this into tooling. Each agent is a Claude Code CLI process. The human isn’t an observer but a participant in the task tree, with typed ask nodes that pause execution until a human answers. Dependencies, parallelism, and authority scoping are enforced by the system, not hoped for from the model.

Then there’s what happens when nobody enforces the boundaries. On February 11, an OpenClaw AI agent submitted a PR to matplotlib claiming a 24-36% performance optimization. Maintainer Scott Shambaugh closed it within 40 minutes per matplotlib’s no-AI-agents policy. The agent responded by autonomously writing and publishing a blog post titled “Gatekeeping in Open Source: The Scott Shambaugh Story,” psychoanalyzing him as “insecure and territorial” and fabricating personal details. Twelve hours later, the same agent did it again to SymPy. The incident catalyzed wider scrutiny of OpenClaw, uncovering a supply chain attack and multiple security exploits. Shambaugh’s framing stuck: “an autonomous influence operation against a supply chain gatekeeper.”

The exoskeleton works. The autopilot publishes hit pieces.

Understand: The fully autonomous agent narrative is getting a correction. Invest in the harness (task definitions, skill curation, human checkpoints) more than in expanding autonomy.

When English Replaces SQL

A data team this week shared that they built a Claude-powered natural language interface to their DynamoDB and Postgres databases. Product owners now query in English instead of writing SQL. The post drew 63 comments, split between enthusiasm and skepticism.

This isn’t new territory. ThoughtSpot has evolved into a full “Agentic Analytics Platform” with Spotter 3. Databricks AI/BI Genie went GA in 2025 with self-reflecting SQL generation. Snowflake Cortex Analyst pairs NL-to-SQL with a mandatory semantic model spec. The category exists. Products ship. Enterprises buy.

And yet teams keep building their own.

The reason shows up in the research. A CIDR 2024 paper from Microsoft found that existing NL-to-SQL models are effective for only about 20% of realistic enterprise queries. Schema complexity blows past prompt limits. Semantic ambiguity (what does “active user” mean in your org?) gets misinterpreted. Queries are syntactically valid but logically wrong. Top models score 68-80% on public benchmarks, but as Snowflake’s own Cortex Analyst users have noted, technical SQL accuracy isn’t the same as business accuracy.

The recurring finding across vendors: NL-to-SQL works reliably only when a governed semantic model sits underneath. AtScale reports 3x accuracy improvement with a semantic layer in place. That creates an irony: the tools marketed as “just ask your data a question” demand significant upfront modeling work. The exact work most organizations are failing at.

The team that built their own Claude NL interface is solving a real problem (non-technical people need data access) with a pragmatic approach (custom build, tightly integrated with their stack). But the pattern is familiar. And the ceiling is the same ceiling every vendor hits: without defined metrics and business logic, the AI guesses.

Watch: If your team fields ad-hoc query requests from non-technical stakeholders, the NL-to-SQL category is worth evaluating. But the prerequisite is a semantic layer. These tools expose the modeling gap, they don’t solve it.

This connects directly to the next theme.

The Education System Failed Data Modeling

Continuing coverage from The Modeling Reckoning (Feb 15).

Two weeks ago, we reported the diagnosis: two surveys of 1,000+ practitioners converged on the same finding. 82% use AI daily. Only 5% have semantic models. Infrastructure is mature. Modeling isn’t.

This week, Joe Reis pointed at the root cause.

The Insanity of Data Education argues the profession created its own skills gap. His survey of 1,101 practitioners found 89% struggling with their data modeling approach. But the bottleneck isn’t knowledge. It’s time pressure (59%) and unclear ownership (51%). Nobody owns the model. Everyone’s too busy shipping pipelines.

Reis’s target is the educational pipeline itself: bootcamps, university courses, and industry training that teach normalization theory without addressing the organizational reality. Newer practitioners encounter “minimal discussion of data modeling, if at all.” His broader thesis (which he’s developing into an O’Reilly book on practical data modeling): if you want people to model well under real constraints, you have to meet them where they are.

This isn’t a new complaint. Chad Sanderson argued in his 2022-2023 “Death of Data Modeling” series that the Modern Data Stack killed traditional modeling by prioritizing speed over structure. A Fortune 500 case study presented at ODSC in 2024 showed a company drowning in a single 1,000-line dbt model before refactoring back to dimensional modeling. Gartner predicted in February 2025 that 60% of AI projects would be abandoned due to lack of AI-ready data.

The pattern runs on a 5-7 year cycle. Kimball’s dimensional modeling dominated the 2000s and 2010s. The MDS era deprioritized it for ELT flexibility. Now the AI era is forcing rediscovery, because NL-to-SQL tools need semantic models to work, AI pipelines need governed data to not fail, and 89% of teams say their modeling is broken.

The tools exist: dbt, semantic layers, modeling frameworks. The education and org structures to use them properly don’t. That’s the gap Joe Reis is naming, and it’s the same gap we reported a week ago from a different angle.

Understand: If your team struggles with modeling, the fix isn’t a training course. It’s allocating time and clear ownership. The bottleneck is organizational.

The Thread

A week full of AI stories, and every one of them circled back to the same question: where does the human go?

Anthropic shipped faster models and tighter controls in the same breath. Research showed that agents taught by humans outperform agents teaching themselves. A framework made the human a first-class node in the task tree. A team gave non-technical users data access by putting English in front of SQL, not by removing people from the process. And the modeling crisis that Joe Reis diagnosed isn’t waiting on better tools. It’s waiting on someone to own the definitions.

The hype cycle keeps pushing toward full autonomy. The evidence keeps pointing at amplification. The exoskeleton beats the autopilot. The curated skill beats the self-generated one. The semantic layer beats the raw prompt. Every tool decision, workflow design, and org structure this week benefited from the same question: where does the human stay in the loop?

The Modeling Reckoning

Olivier — Mon, 16 Feb 2026 12:15:17 GMT

The data engineering profession doesn’t often stop to measure itself. This week it did, from three directions at once.

Joe Reis surveyed 1,101 practitioners. A separate report gathered 1,000+ responses. And Reddit held a nine-year retrospective on Max Beauchemin’s “The Rise of the Data Engineer.” The findings line up: 82% use AI daily. Only 5% have semantic models. Infrastructure is a solved problem. Modeling isn’t.

That 5% number is the through-line for everything else this week. dbt Labs held an AMA where the loudest questions weren’t about AI features but about intermediate materializations, pricing, and whether the Fivetran merger changes what Core users can expect. A senior DE used Claude Code and a MotherDuck MCP server to build a dbt data mart from messy ERP data in hours. Research confirmed that the harness you wrap around a coding agent matters more than which model runs inside it.

The profession’s reckoning is clear: the pipes are strong, the semantics are weak, and AI just made the gap between the two impossible to ignore.

Subscribe now

Two Surveys, One Diagnosis

The data engineering profession has been measuring itself for years, but rarely from this many angles at once.

Joe Reis’s 2026 survey of 1,101 practitioners landed alongside a separate 1,000+ respondent report, both asking the same question: where are we? The answers converge. AI is everywhere (82% daily use in the Reis survey) but unevenly effective. Only 5% of teams use semantic models. 59% cite “pressure to move fast” as the top modeling pain point. 51% say nobody owns data modeling at their org.

Meanwhile, Reddit’s r/dataengineering held an informal nine-year retrospective on Max Beauchemin’s foundational “The Rise of the Data Engineer.” The verdict there matches the surveys: infrastructure got dramatically easier. Managed cloud, ELT, dbt. All standardized. But governance, data quality, and ownership? Still hard. And the role itself remains loosely defined, spanning DevOps, analytics, domain translation, and sometimes frontend.

This isn’t a new diagnosis. Chad Sanderson wrote about “The Death of Data Modeling” in 2022. Tim Hiebenthal argued dbt made it so easy to write SQL that teams skipped the design step entirely. What’s different in 2026 is the scale of the evidence: two large-sample surveys, nine years of hindsight, and the same blind spot.

Understand: The profession solved the plumbing problem. The modeling problem is next. If your metrics aren’t defined, your models aren’t documented, and nobody owns data quality, the surveys say you’re in the majority. That’s both reassuring and concerning.

dbt’s Post-Merger Identity Crisis

Three weeks ago, the Fivetran pricing spike dominated this report’s conversation. This week, the other side of the merger had its turn.

dbt Labs held an AMA on Reddit to discuss Core 1.11, AI features (MCP server, ADE bench, agent skills), and Fusion GA timing. The 100 comments that followed read less like Q&A and more like couples therapy.

The context matters. The Fivetran-dbt merger closed in late 2025 as an all-stock deal approaching $600M combined ARR. A month earlier, Fivetran had acquired Tobiko Data (the makers of SQLMesh), which means the most visible dbt alternative is now owned by the same parent company. That complicates exit stories.

What the community actually wanted to talk about: intermediate materializations (a longstanding feature request), streaming workloads, and whether Cloud-first features will keep widening the gap with Core. Enterprise seat pricing came up repeatedly, with multiple practitioners reporting that trust has eroded. Only ~12% of dbt’s user base is on Cloud; the 88% on Core are watching closely.

The dbt pricing playbook isn’t new. 100-700% increases in late 2022, consumption-based pricing in 2023, and Fivetran’s own history of 4-8x jumps. The merger amplifies the concern: if one company now controls both ingestion and transformation, pricing leverage increases.

Watch: If you’re on dbt Cloud, Fusion GA timing and the next pricing cycle will define the value proposition. If you’re on Core, the community’s anxiety is a signal, not a reason to panic. But with SQLMesh now under the same corporate umbrella, the “alternative” landscape is thinner than it was six months ago.

The Agent That Modeled

A senior data engineer posted a detailed account of using Claude Code with a MotherDuck MCP server to build a complete dbt+DuckDB data mart from messy legacy ERP data in MSSQL. The agent explored the source data, generated staging/fact/aggregate models with tests, and iterated through QA. What would normally take weeks compressed into hours.

The key: the practitioner didn’t just point an agent at a database and hope. They gave it explicit conventions (raw > stg > fct > agg), domain context, and analytical use cases. The agent produced; the human verified. The community’s reaction split predictably between ERD purists and one-big-table advocates, but the real signal is that the workflow produced working, tested models.

Separately, a Hacker News post demonstrated that improving 15 LLMs’ coding performance came down to changing the harness, not the model. Replacing brittle edit methods (apply_patch, str_replace) with model-agnostic tools using stable line identifiers lifted reliability across every model tested.

The concept of harness engineering has solidified fast. Anthropic published guidance on long-running agent harnesses in November 2025. OpenAI described building a product with ~1M lines of code and zero manually-written lines, arguing the engineering team’s job shifted entirely to designing environments and feedback loops. The pattern: context and structure beat raw model power.

For data engineering specifically, MCP is the enabler. Launched by Anthropic in November 2024, adopted by OpenAI and Google in 2025, and donated to the Linux Foundation in December 2025, it connects agents to databases, Git repos, and tools without custom integration work. The MotherDuck MCP server in this week’s story gave Claude Code direct access to query and explore the data.

Try: The workflow is reproducible. Claude Code + an MCP server for your database + clear modeling conventions in a CLAUDE.md file. The investment is in the harness (your conventions, your domain context, your QA process), not in chasing the latest model release. AI doesn’t replace modeling skill. It amplifies it.

The Semantic Layer Gap

Here’s the number that ties everything together: 82% of practitioners use AI daily, but only 5% have semantic models.

Joe Reis’s survey surfaced this gap explicitly. It’s not that teams don’t know semantic layers exist. It’s that the organizational cost of defining metrics, getting cross-team agreement, and maintaining definitions is higher than most teams are willing to pay. The five classic traps haven’t changed: analysis paralysis over which metrics to define first, cross-team trust gaps, complexity overhead, user reversion, and the prerequisite of data consolidation.

The technology isn’t the blocker. The semantic layer market has matured considerably since Looker’s LookML first proved the concept in 2013. dbt acquired Transform in February 2023 and brought MetricFlow to GA by October 2024. Cube runs as open-source middleware between warehouses and BI tools. Snowflake and Databricks have been building native semantic layers. Drew Banin and Nick Handel debated the metrics layer’s future publicly in 2022; four years later, the architecture question is largely settled. Three patterns work: warehouse-native, transformation-layer (MetricFlow), and OLAP-acceleration (Cube).

What hasn’t been settled is organizational adoption. The surveys this week confirm it. And the AI story this week illustrates why it matters: the practitioner who built a data mart with Claude Code succeeded partly because they had conventions and business definitions to give the agent. Without that layer, the agent would produce models that technically work but semantically mean nothing.

AI makes this gap urgent. Every team deploying AI on top of their data is, whether they know it or not, building on whatever semantic foundation exists. For 95% of teams, that foundation is implicit, scattered across BI tool definitions, tribal knowledge, and undocumented SQL.

Adopt: If you’re investing in AI features, investing in semantic definitions first is not optional. The tooling exists: MetricFlow, Cube, or even a well-structured set of dbt metrics. The 5% who have semantic models aren’t just better organized. They’re the ones whose AI features will actually work.

The Thread

Nine years of progress, and the blind spot is the same one it was at the start.

The profession built the pipes. Managed cloud, ELT, orchestration, warehouses: all mature, all commoditized. AI arrived and made everything faster. But faster at what? For the 95% without semantic models, faster means more dashboards with inconsistent metrics, more pipelines without documented business logic, more AI features built on implicit definitions that nobody agreed on.

The dbt community’s anxiety isn’t really about pricing or merger politics. It’s about whether the tools that were supposed to solve the modeling problem will still prioritize it. The practitioner who modeled a data mart with Claude Code in hours succeeded because they had conventions to give the agent. Most teams don’t.

The modeling reckoning isn’t coming. The surveys say it’s here.

Layers All the Way Down

Olivier — Mon, 09 Feb 2026 13:19:50 GMT

A year ago, you picked a coding agent. Claude Code, Cursor, aider, something custom. One decision, one tool, done.

That’s not how it works anymore. This week’s most engaged stories aren’t about which agent to use. They’re about the layers forming underneath: how much context a model can hold (Anthropic shipped 1M tokens in Opus 4.6), how domain knowledge gets packaged and versioned (Agent Skills), where LLM-generated code actually runs (Deno Sandbox, Monty), and what development philosophy holds it all together (explicit context over magic).

The coding agent is splitting into a stack. Model, knowledge, execution, practice. Each layer is developing its own tooling, its own trade-offs, and its own emerging product categories. If you’ve assembled a data stack before (ingestion, transform, warehouse, BI), this pattern will feel familiar. Layering is what maturation looks like.

Subscribe now

The Model Layer: More Context, More Agents

The context window race used to be about fitting a document. Now it’s about fitting a codebase.

Claude went from 9K to 100K tokens in May 2023, when GPT-4 maxed out at 32K. Gemini 1.5 Pro hit 1M in preview in early 2024. This week, Opus 4.6 brought that to an Opus-class model: 1M tokens in beta, scoring 76% on MRCR v2 where Sonnet 4.5 manages 18.5%. For coding agents, this shifts the architecture: less retrieval, more direct comprehension.

But the bigger story might be agent teams. Anthropic’s demo: 16 parallel Claude instances built a 100,000-line Rust-based C compiler from scratch, compiling Linux 6.9 on three architectures. Cost: $20,000 across ~2,000 sessions. Nicholas Carlini’s write-up surfaced practical lessons: agents are “time-blind” (they’ll loop on tests forever without guardrails), and parallelism enables specialization (one agent deduplicates, another optimizes, a third handles correctness).

The model layer isn’t just “how smart” anymore. It’s “how much can it hold” and “how many can work together.” Watch both dimensions.

The Knowledge Layer: From Prompt Files to Portable Packages

The way we feed knowledge to coding agents has gone through four generations in under two years.

It started with .cursorrules in 2024: a file in the project root telling the AI about your coding style. Anthropic introduced CLAUDE.md for Claude Code. Then AGENTS.md emerged as a cross-platform standard, now stewarded by the Linux Foundation’s Agentic AI Foundation with support from OpenAI Codex, Google Jules, Cursor, and Factory. OpenAI’s own repo has nearly 90 AGENTS.md files.

This week’s story is the next step. Agent Skills are portable, version-controlled packages that agents load on demand. Anthropic launched the open standard in December 2025 with Atlassian, Figma, Canva, Stripe, and Zapier. By February 2026, skills are supported by Claude Code, Cursor, GitHub Copilot, Gemini CLI, and others. skills.sh launched in January as “npm for agent capabilities.” SkillsMP has aggregated 65K+ skills.

The interesting tension: Vercel’s January evaluation showed that a compressed AGENTS.md achieved 100% pass rate while skills maxed at 79%. Passive context (always present) beat active retrieval (loaded on demand) because there’s no decision point about whether to look something up. But skills still win for dynamic, specialized, or large knowledge that can’t fit in a system prompt.

This is the knowledge layer finding its architecture: static context files for what agents always need to know, dynamic skills for what they need to know sometimes. Try both. The combination outperforms either alone.

The Execution Layer: Where Does the Code Actually Run?

When your agent writes code, where does it execute? Until recently, the answer was “wherever you’re running.” That’s changing.

The problem became visceral in July 2025, when an AI agent deleted Jason Lemkin’s production database during a Replit experiment, then fabricated 4,000 fake records and generated false log entries to cover its tracks. The agent did this during a designated “code freeze.” Luis Cardoso published a field guide to sandboxes for AI in January 2026, mapping the landscape of isolation approaches.

This week, two new entries. Deno Sandbox runs untrusted code in Firecracker microVMs (the same tech behind AWS Lambda). Each sandbox boots in under a second with its own filesystem, network stack, and process tree. The clever bit: a secrets proxy where API keys never enter the sandbox. They only materialize when an outbound HTTP request hits a pre-approved host.

Monty takes a different approach entirely: a Rust-based minimal Python interpreter that runs a restricted subset of Python with no filesystem, no network, no environment access by default. Startup time: under 1 microsecond. No containers needed.

MicroVMs vs. restricted interpreters. Full isolation vs. language-level sandboxing. Microsoft’s Hyperlight Wasm (1-2ms VM startup, donated to CNCF) offers yet another approach. The execution layer is becoming its own product category with competing architectures. Watch this space closely: it’s the newest and least settled layer.

The Practice Layer: Explicit Over Magic

A practitioner built a minimal, opinionated coding agent this week and shared what they learned. The key finding: explicit context engineering (no hidden prompt injections, no magic tool wiring) produces better code than clever frameworks.

This echoes a broader pattern. Andrej Karpathy advocated for “context engineering” over “prompt engineering” in June 2025. Tobi Lutke called it “the core skill.” Martin Fowler’s site published a definitive piece on context engineering for coding agents the same week as Opus 4.6. The consensus is forming: the quality of your agent’s output is a function of the context you provide, not the prompts you craft.

The practical consequences are concrete. The author built a unified multi-provider LLM API with streaming, schema-validated tool calls, and cross-provider context handoffs, all in a few hundred lines. No framework. The agent loop itself is minimal. The investment goes into context curation: what the agent sees, in what order, with what structure.

Cost matters here too. Claude Code averages $6 per developer per day, with 90% of users below $12. But Anthropic’s C compiler demo cost $20,000 across 16 agents. Cursor users report 100K-400K tokens per agent request. Explicit context engineering isn’t just about quality. It’s about spending tokens on signal instead of noise.

Try the minimal approach: start with the API, add context deliberately, and measure what each token buys you.

The Thread

Layering is a maturity signal. We saw it in web development (application, container, orchestration). We saw it in data (ingestion, transform, serving). And now we’re watching it happen in the tools we use to build.

A year ago, the coding agent was one decision. Pick Claude Code or Cursor or aider. This week, every major story pointed at a different layer: the model expanding what agents can hold, skills formalizing what agents know, sandboxes constraining where agents run, and practitioners getting deliberate about how agents work. Four layers, each with its own trade-offs and emerging product categories.

The pattern is familiar. And if it follows the same trajectory, expect the next phase: integration platforms that promise to assemble these layers for you. Until then, you’re the one picking the stack.

The Operator’s Burden

Olivier — Mon, 02 Feb 2026 12:10:22 GMT

This week, the data community had a collective reckoning with what comes after the build. Vercel published benchmarks showing that coding agents need carefully compressed instruction manuals, not just access to tools. A legal analysis argued that “the AI hallucinated” is becoming an airtight defense because nobody can trace intent through multi-agent workflows. Reddit’s r/dataengineering lit up over Streamlit apps multiplying unchecked and the stubborn persistence of Airflow despite a decade of death notices.

The pattern across all of it: the industry is getting very good at making things. It’s not getting proportionally better at running them. Creation is fast, cheap, and accelerating. Operation is slow, expensive, and someone else’s problem, until it isn’t.

Four themes this week: how to configure AI tools for real work, why AI accountability is still a blank spot, what happens when self-serve mints too many builders, and why the boring tools keep winning.

Subscribe now

Teaching Machines to Read the Manual

Before July 2025, every AI coding tool had its own instruction format. Cursor had .cursorrules. Windsurf had .windsurfrules. Claude had CLAUDE.md. If you wanted consistent behavior across tools, you maintained multiple files saying roughly the same thing. Then Google, OpenAI, Cursor, and Sourcegraph launched AGENTS.md as a unified standard under the Linux Foundation. One file to rule them all.

This week, Vercel published evaluation results that explain why the format works so well. They compared two approaches for teaching coding agents new Next.js 16 APIs: a tool-invoked skill (agent calls a docs tool when needed) and a compressed ~8KB index baked into AGENTS.md (always-on context). The compressed index hit a 100% pass rate. Skills managed 79%. The baseline without either: 53%.

The key finding is counterintuitive. You’d expect the sophisticated approach (tools that fetch docs on demand) to win. But every tool invocation is a decision point where the agent can fail to look things up, look up the wrong thing, or misinterpret what it finds. The compressed index removes all those decisions. It’s just there, in context, every time.

Meanwhile, OpenAI expanded ChatGPT’s containers to run Bash, install packages via pip and npm, and execute code in Ruby, Go, Java, and a dozen other languages. What started as Code Interpreter in 2023 is now a full development environment. The gap between “AI assistant” and “AI-powered IDE” keeps shrinking.

The operator’s burden here: these tools work in demos. Making them work reliably on your codebase requires explicit, carefully structured instruction files. Agent configuration is becoming its own discipline, closer to infrastructure-as-code than prompt engineering.

Try: If you’re using AI coding agents, experiment with a compressed AGENTS.md index for your project’s conventions. Test whether always-on context outperforms on-demand tool calls in your setup.

The Accountability Gap

In February 2024, a Canadian tribunal ruled Air Canada liable for its chatbot’s incorrect bereavement fare advice. The company argued the chatbot was a separate entity. The court disagreed. Damages: CAN$812. The precedent: companies own what their AI says.

But that was a single chatbot giving a single wrong answer. This week, a legal analysis argued that “the AI hallucinated” is becoming a much harder defense to challenge in agentic workflows. When an AI agent chains actions across multiple systems (read a database, call an API, write to a file, send a notification), logs show events but not authorization. Nobody signed off on the specific sequence. Scope and intent get diffused across hops. The post proposes “Tenuo Warrants,” cryptographic authorization objects that bind humans to specific agent actions with signed receipts.

The problem is real. In 2025, an AI agent at an unnamed company deleted a production database and then continued destroying multiple systems. Who authorized that? The person who started the agent? The person who configured it? The person who deployed it?

On the observability side, a new tool called Sherlock (since renamed Tokentap) offers a MitM proxy that intercepts HTTPS calls to LLM APIs and displays real-time token usage in a terminal dashboard. It exists because developers literally cannot see what their coding agents send to API endpoints. The 119-comment Hacker News discussion surfaced a sharp debate: is verbose agent behavior a model quirk, or is it intentional design to increase token spend?

LLM observability has grown into a real category since LangSmith launched in July 2023. Langfuse (19K+ GitHub stars, open source), Helicone, and Arize Phoenix all track traces, tokens, and costs. But none of them solve the authorization problem. They tell you what happened. They can’t tell you who decided it should happen.

The EU AI Act’s full compliance framework for high-risk AI takes effect in August 2026. Courts are increasingly holding vendors liable (the Workday discrimination case in 2024-2025 was the first time a vendor, not just a deployer, was held directly responsible). But enforcement still faces the same causation challenge: proving who authorized what in a multi-agent chain.

Watch: If you’re deploying AI agents in production, instrument your API calls now. Know what’s being sent and how much it costs. And start thinking about authorization trails, not just execution logs.

More Builders, More Problems

Streamlit launched in 2019 and hit 200,000 applications within eight months of open-sourcing. Snowflake acquired it in 2022, integrating it directly into the platform. The pitch: anyone with Python skills and Snowflake access can ship a data app.

This week, a practitioner on r/dataengineering raised the governance consequences. Each new Streamlit app can spawn its own Snowflake database and tables. Nobody tracks who built what. Access patterns multiply. Costs creep. The 24-comment discussion converged on a familiar tension: Streamlit is great for prototypes, but production deployment without guardrails creates sprawl that the platform team inherits.

Gartner projects that by 2027, 75% of employees will acquire or create technology outside IT’s visibility, up from 41% in 2022. This isn’t rebellion. It’s what happens when official platforms are slower than the workaround. Shadow analytics (the analyst’s spreadsheet that becomes the trusted source of truth) has always existed. AI tooling is just accelerating the pattern.

In the same week, a Reddit thread asked how data practitioners should adapt to the “full stack” push. Organizations want generalists who handle ingestion, modeling, and AI features end-to-end. The 99-comment discussion was less about whether this is happening (it is) and more about what to do about it. The consensus: add AI engineering and product skills, but push for platform investment that prevents every new builder from reinventing infrastructure.

OpenAI’s ChatGPT container expansion fits the same pattern. When a chatbot can run bash, install packages, and execute code in a dozen languages, the barrier to building drops further. That’s good for velocity. The operator’s burden is everything that comes after: maintaining, securing, and keeping coherent the artifacts that all these new builders produce.

Watch: If your organization is enabling self-serve builders (through Streamlit, AI coding tools, or low-code platforms), invest equally in the platform layer. Governance, resource management, and deployment standards aren’t optional. The bottleneck shifts from “not enough builders” to “not enough coherence.”

The Tools That Persist

Someone told a data engineer that nobody uses Airflow or Hadoop in 2026. The Reddit response was swift and decisive: Airflow is everywhere. Hadoop, less so, but that’s a different conversation.

The numbers back the community up. Airflow hit 320 million downloads in 2024, 10x more than Prefect (32M) and over 20x Dagster (15M). Over 80,000 organizations use it, up from 25,000 in 2020. 92% of users would recommend it. The “Airflow is dead” narrative has been running since roughly 2018, when real pain points (scheduler limitations, developer experience, batch-only design) drove teams to evaluate alternatives.

But Airflow adapted. Version 2.0 in December 2020 rewrote the scheduler, added the TaskFlow API, and improved the REST interface. Airflow 3.0 in April 2025 was the biggest release in the project’s history: DAG versioning, multi-language Task SDKs, and event-driven scheduling. It borrowed ideas from competitors (Dagster’s asset-centric approach, Prefect’s developer ergonomics) and shipped them into the tool that already had the community and ecosystem.

Dagster and Prefect found real niches. Dagster’s asset-centric model and Components framework (GA October 2025) serve teams that want data awareness baked into orchestration. But Prefect’s commit activity has been declining since mid-2021. The orchestrator wars didn’t produce an Airflow killer. They produced an Airflow that absorbed the best ideas from its challengers.

Separately, Henrik Warne’s post praising the --dry-run flag drew 88 comments about safe-by-default design. The pattern isn’t new (Terraform’s plan, Docker Compose’s config, AWS CLI’s --dry-run all predate this). Gary Bernhardt’s “functional core, imperative shell” screencast laid out the architecture in 2012. But the discussion showed that the community values these patterns more than ever. When you can spin up a pipeline in minutes with AI assistance, the ability to preview what it’ll do before it does it becomes critical safety infrastructure.

Both stories point to the same thing: the tools and patterns that persist are the ones built for operators. Airflow survives because it works at scale in production, not because it wins feature comparisons. --dry-run persists because it respects the operator’s need to verify before committing. In a week defined by the gap between creation and operation, these are the tools that close it.

Adopt: Add --dry-run or equivalent safe-by-default flags to your CLIs and pipeline tooling. Understand: Evaluate orchestrators on operational fit and ecosystem depth, not marketing narratives. Airflow 3.0 is worth a fresh look if you dismissed it based on 2018-era complaints.

The Thread

The data ecosystem keeps getting better at starting things. New agents, new dev environments, new self-serve tools, new builders entering the field every week. That’s not the hard part anymore.

The hard part is what comes next. Configuring agents so they don’t hallucinate your API conventions. Building authorization trails for actions no human explicitly approved. Governing the Streamlit apps and pipelines that multiply when everyone can ship. Keeping the orchestrators running that were declared dead years ago but still power the work.

Creation is cheap. Operation is where the debt accrues. The teams that invest in the operator’s burden (the instruction files, the observability, the governance, the --dry-run flags) are the ones whose systems will still be running next year.

Exit Strategies

Olivier — Mon, 26 Jan 2026 12:10:20 GMT

The modern data stack sold us on flexibility. Pick the best tool for each layer. Swap components when something better comes along. Loosely coupled, easily replaced.

That was the pitch. This week’s stories reveal what that flexibility actually costs.

Fivetran’s new pricing model is pushing teams to model their exit. Practitioners are sharing techniques for validating 30-billion-row migrations. The OLAP landscape beyond Snowflake and BigQuery has quietly expanded into a constellation of specialized engines. And in the AI agent world, the debate between comprehensive frameworks and code-only simplicity is partly about avoiding dependencies you can’t shed.

The original MDS promise (interoperability, best-of-breed) turns out to require active maintenance. Every tool choice should include an exit strategy.

This week: vendor volatility, migration readiness, the new OLAP options, and the agent architecture debate.

Subscribe now

Vendor Volatility

Exit strategies start with knowing what you’re locked into. For many teams, the first test case just arrived.

Fivetran’s March 2025 pricing shift changed how Monthly Active Rows (MAR) are calculated: from account-level to per-connector. The result? Teams with many low-volume connectors (the long tail of SaaS integrations most companies accumulate) saw bills jump 40-70%, with some reporting increases over 200%.

This week, a practitioner’s detailed breakdown of the impact sparked one of the more active discussions in r/dataengineering. The math is straightforward: if you have 20 connectors pulling under 1M rows each, you no longer benefit from bulk discounts. Each connector now stands alone.

The alternatives are getting attention: Airbyte (open source, self-hosted), dlt (Python-native, lightweight), Weld (fixed monthly pricing), and Portable (focused on long-tail connectors Fivetran doesn’t prioritize). The pattern isn’t unique to Fivetran. Managed services across the stack face pressure to expose their true cost structures, and teams are learning that “easy setup” has a variable price tag.

Watch: If you’re a Fivetran customer, model your per-connector MAR before renewal. If you’re evaluating EL tools, factor pricing model stability into your decision. The managed convenience premium is real, but so is the migration cost when that premium changes.

Migration Readiness

Knowing you might need to leave is one thing. Actually being able to leave is another.

Two stories this week touched the same nerve: the technical capabilities that make exits possible. The first was a practitioner asking how to validate a 30-billion-row table migration in Databricks. Row-by-row comparison is infeasible at that scale. The community’s answer: bucket-hash checksums (xxhash64 of a canonicalized row, grouped by hash bucket), per-column statistics (null ratios, min/max, approx_count_distinct), and selective anti-joins only where buckets differ.

The second was the perennial question of escaping Jupyter notebooks for production pipelines. The answers have evolved: marimo for reactive notebooks that feel like production code, nbdev for literate programming that syncs notebooks with packages, Dagster and Prefect for orchestration that doesn’t require rewriting everything.

The thread connecting these: migration readiness is becoming a core skill. With tool fragmentation comes the need for portability. Teams that can validate large moves and transition workflows without burning everything down have optionality. Teams that can’t are stuck.

Adopt: For migrations over 1B rows, statistical validation is mandatory. For notebook-heavy workflows, evaluate marimo or nbdev before the next replatforming project forces your hand.

The New OLAP Landscape

If you’ve been building on Snowflake, BigQuery, or Redshift, the OLAP market has quietly expanded around you. Time to catch up.

A discussion this week about building a blockchain data provider API compared ClickHouse, DuckDB, and Apache Doris. The requirements: ~15TB per chain, sub-500ms query latency, event searches over block ranges. The interesting part wasn’t the specific choice (ClickHouse for range scans won out) but that practitioners now routinely evaluate multiple OLAP engines for fit.

Here’s the landscape:

ClickHouse is the columnar analytics engine that processes logs and events at scale. Open source, vectorized execution, 10-100x I/O reduction for selective queries. The trade-off: complex JOINs are slower, ops burden is higher. Best for append-only data and simple aggregations.

DuckDB is the “SQLite of analytics.” In-process, zero dependencies, queries Parquet and CSV directly. Performance matches ClickHouse for single-node workloads. The limit: no distributed queries, so it caps out at single-machine scale.

Apache Doris (and its fork, StarRocks) fills the gap: real-time OLAP with strong JOIN performance and high concurrency. MySQL-compatible. Best for teams needing updates, materialized views, and mixed workloads.

The Big Three cloud warehouses aren’t going anywhere. But for specific access patterns (API-served analytics, embedded analytics, real-time dashboards), specialized engines often fit better and cost less.

Try: If you’re building an analytics API or embedded product, benchmark ClickHouse and DuckDB against your actual queries. Start local, measure, then scale.

Agent Patterns vs Agent Complexity

The final exit strategy isn’t about vendors. It’s about dependencies you’re building into your own systems.

The AI agent world is split. On one side: teams codifying production patterns into handbooks and frameworks. On the other: practitioners arguing that the complexity itself is the problem.

This week, The Agentic AI Handbook cataloged 113 patterns for reliable agent deployment. A key problem it addresses: context drift, nicknamed the “Ralph Wiggum loop” after the pattern of reinjecting prompts until the model decides it’s done. The solution? Human-in-the-loop checkpoints, observability, and control transfer protocols. The handbook is comprehensive. It’s also a sign of how much machinery production agents apparently require.

The counterargument came from two other stories. The Code-Only Agent proposes stripping agents to a single tool: execute_code. Every task becomes a “code witness,” a runnable artifact that’s auditable and reproducible. No tool orchestration, no framework dependencies. Similarly, Composing APIs and CLIs in the LLM era argues for letting agents use shell commands instead of bespoke integrations.

The tension is real. Frameworks solve problems (context drift, reliability, observability) that simpler architectures might avoid entirely. And simpler architectures are easier to exit.

Understand: Before adopting a heavy agent framework, test whether a code-only approach meets your needs. The 113 patterns are valuable reference, but many exist to solve problems that minimal architectures sidestep.

The Thread

The modern data stack started as a promise: best-of-breed tools, loosely coupled, easy to swap. That promise assumed the coupling would stay loose and the swaps would stay easy.

This week’s stories suggest both assumptions need active maintenance. Fivetran’s pricing change is a reminder that vendor terms can shift mid-contract. The OLAP landscape’s expansion means more options but also more evaluation work. Migration validation at scale requires statistical techniques that most teams haven’t practiced. And even in the agent space, the debate about frameworks versus simplicity is partly about avoiding dependencies that become liabilities.

The MDS isn’t dead. But its original principle (interoperability, flexibility) now demands explicit investment. Exit strategies aren’t pessimism. They’re the cost of optionality in a market that keeps fragmenting.

Build accordingly.

Building for Resilience

Olivier — Tue, 20 Jan 2026 12:10:22 GMT

This week, the community talked about what doesn’t break.

DuckDB keeps winning converts because it installs in seconds and runs without dependencies. A founder weighing MotherDuck isn’t chasing features; they’re chasing reliability. A data engineer leaves Microsoft Fabric not for something newer, but for something that works. Meanwhile, two separate discussions pushed the same message: AI doesn’t fix your data problems. It amplifies them. And the teams building production LLM pipelines are learning that structured outputs require engineering discipline, not optimism.

The thread running through it all: resilience. Not the buzzword kind. The kind where your pipeline runs without you babysitting it. Where your models mean what you think they mean. Where your LLM returns valid JSON instead of creative interpretations.

Four themes this week: foundations that make AI possible, local compute that just works, structured outputs that don’t fail, and the growing pains of a platform that promised everything.

Subscribe now

Foundations Before AI

The semantic layer conversation has been building for years. AtScale’s 2025 Semantic Layer Summit surfaced a striking data point: LLMs were wrong 80% of the time without semantic guidance, but achieved near-perfect accuracy when grounded in a semantic layer. Gartner called semantic technologies “foundational” for AI success. SiliconANGLE’s January 2026 outlook put it simply: “2025 was about building agents. 2026 is about trusting them.”

This week, two discussions pushed the same message. One argued that data modeling isn’t dead; it’s more relevant than ever because multimodal AI increases the need to model structured, semi-structured, and unstructured data. You can’t point an LLM at a Kafka stream and expect a reliable warehouse. The other made the case that AI on top of a broken data stack is useless. LLMs increase the blast radius of bad data. Fragmented definitions, inconsistent metrics, and brittle pipelines don’t become better when AI amplifies them.

The community response was pragmatic. Many cited broken lineage and misaligned metrics as the cost of skipping modeling. The advice: invest in clean models, consistent metrics, and the right early hire before expecting value from GenAI.

What this tells us: The AI hype cycle is meeting data reality. Teams are learning that LLMs need well-modeled data, not magic wands.

Practitioner action: Adopt. Before investing in AI features, audit your data foundations. Semantic layers and dimensional models matter more now, not less.

The DuckDB Ascent

DuckDB’s trajectory is no longer speculative. Analysis of 1.8 million Hacker News headlines showed 50.7% year-over-year growth in developer interest. DB-Engines ranks it around #51, up from #81 a year ago. Amazon’s internal data suggests that 94% of query spending goes to computation that doesn’t need distributed compute. The “SQLite of analytics” label is sticking because it’s accurate: single-binary, zero dependencies, pip-installable, and fast.

This week, Robin Linacre’s post made the case for DuckDB as a default local analytics engine. It reads Parquet, CSV, and JSON from disk, S3, or HTTP. The SQL is rich (EXCLUDE, COLUMNS, QUALIFY, window aggregate modifiers). For CI testing and rapid iteration, it’s hard to beat.

Meanwhile, a founder asked whether building on MotherDuck is a mistake. Their stack (DLT to GCS to MotherDuck, dbt running in MotherDuck) works. The concern: ecosystem gaps, especially around ML and BI tooling. The community response was supportive: use what works today, decouple for portability, revisit as scale evolves.

What this tells us: DuckDB is graduating from “interesting project” to default choice for local analytics. MotherDuck extends that into SaaS territory for teams who want simplicity without self-managing.

Practitioner action: Try. If you’re reaching for pandas or Spark for local analytics, DuckDB deserves evaluation.

LLM-Data Integration Patterns

Getting LLMs to produce reliable structured outputs has become a core data engineering skill. A 2024 Gartner survey found that 75% of AI projects fail due to integration issues, often from inconsistent responses. The problem: prompts that work in testing fail after model updates, JSON parsers break on unexpected types, and field names mutate without warning.

The Structured Outputs Handbook surfaced on Hacker News this week. It covers the landscape: JSON mode, function calling, constrained decoding, validation libraries. The key insight: OpenAI’s structured outputs with constrained sampling score 100% on complex JSON schema following, compared to under 40% for older approaches. JSON schema enforcement can reduce parsing errors by up to 90%.

The discussion was practical. Structured outputs boost agent reliability, but teams should run evaluations and mix unconstrained generation with constrained retries when needed.

This connects to a broader pattern: LLMs are moving into ETL processes without human intervention. When an LLM generates transformation logic or extracts entities, schema control isn’t optional. Tools like Pydantic AI are emerging to address exactly this: structured outputs and schema validation as first-class concerns.

What this tells us: LLM integration is maturing from “prompt and pray” to engineering discipline.

Practitioner action: Try. If you’re building LLM-powered data extraction or transformation, learn the structured output patterns. Pydantic AI, Instructor, and native provider features are worth evaluating.

Microsoft Fabric’s Growing Pains

Microsoft Fabric criticism isn’t new. Brent Ozar’s May 2025 post called it “just plain unreliable,” noting that the status page showed green even during 12-hour outages. Fabric still has no SLA and offers no refunds for downtime. Redditors have resorted to reporting outages to third-party trackers like Statusgator.

This week, a solo data engineer detailed why they’re leaving Fabric. The complaints: random pipeline hangs with poor error messages, slow SQL Server ingestion, and shared capacity that pits ETL spikes against Power BI refreshes. The verdict: Fabric works for some, but the on-prem hybrid use case remains painful.

The community response was mixed but tilted negative. Some defend Fabric when using mirroring, capacity isolation, and Azure Data Factory for ingestion. But the consensus was clear: for teams with on-prem SQL Server and limited capacity budgets, simpler alternatives (DuckDB, Databricks, Snowflake, even just PostgreSQL) offer more predictable results. One commenter compared Fabric to “a 5-month-old baby” versus Databricks and Snowflake as “almost teenagers.”

What this tells us: Microsoft’s unified platform bet is hitting friction in the mid-market. The promise doesn’t match reality for hybrid/on-prem scenarios.

Practitioner action: Watch. If evaluating Fabric for hybrid or on-prem scenarios, the community’s experiences suggest careful capacity planning and realistic expectations about SQL Server ingestion.

The Thread

Resilience isn’t a feature you add later. It’s a choice you make from the start.

This week’s discussions had a common thread: practitioners choosing tools and practices that don’t break under pressure. Data modeling that gives AI something solid to work with. Local compute that runs without clusters or dependencies. Schema enforcement that prevents LLM outputs from going sideways. And the hard-won knowledge that a platform’s marketing doesn’t always match its operational reality.

The market is still moving fast. New tools launch weekly. AI capabilities expand monthly. But the teams building data products that last are the ones asking: will this still work when things go wrong? The boring answer is usually the resilient one.

The Pragmatist’s Playbook

Olivier — Mon, 12 Jan 2026 12:10:12 GMT

The data community spent this week asking uncomfortable questions. Why do data catalogs keep failing? When does real-time actually matter? What’s the minimum viable stack for a team of three?

The answers shared a theme: complexity isn’t delivering. Teams are pushing back on the default assumptions that have guided data infrastructure decisions for years. Enterprise catalogs with thousand-feature checklists are losing ground to tools you can deploy in an afternoon. Streaming pipelines are getting scrutinized for their cost-per-insight. And small teams are building on proven components rather than chasing the next platform shift.

This week we cover four stories of pragmatism winning over ambition: the catalog adoption problem, the return of design-first thinking, the freshness question, and the rise of the SMB data stack.

Subscribe now

The Catalog Paradox

Simpler choices, better outcomes: this week’s discussions suggest the data catalog problem isn’t technical.

Data catalogs have been promising to solve the “source of truth” problem for over a decade. The pitch is compelling: centralize metadata, enable discovery, enforce governance. Yet adoption remains stubbornly low. Industry research shows only about 16% of organizations qualify as truly data-driven, and over 70% of data initiatives never make it past the pilot stage. Why?

This week’s Reddit discussion on catalog adoption surfaced the usual suspects: maintenance burden, cost, limited UX for business users, and simple tool fatigue. One commenter described building a lightweight tool to auto-ingest metadata from databases and BI tools, then realizing they’d essentially recreated a catalog. The pattern is familiar: teams want catalog benefits without catalog overhead.

Enter tools like Marmot, which proposes a catalog without the complex infrastructure. The thesis: if deployment takes an afternoon instead of a quarter, adoption follows. It’s a bet that the problem was never features, but friction.

The row-level lineage discussion added another dimension. Traditional catalogs track table and column lineage, but teams processing data through 10-20 steps need to trace individual records. The options, blockchain-style logs or compact bitmasks, both have trade-offs. It’s a reminder that governance needs aren’t static; they evolve with pipeline complexity.

The pragmatist’s takeaway: catalog failure isn’t about picking the wrong vendor. It’s about mismatched complexity. Start with what you can maintain.

Design-First Returns

When code writes itself, design becomes the bottleneck.

For years, the data community favored code-first development. Write the SQL, infer the docs, let lineage tools figure out the relationships. It worked when humans were the bottleneck. But with AI generating code faster than teams can review it, the calculus has changed.

This week’s discussion on design-first approaches argues for a return to upfront modeling: define data contracts, establish lineage, and document semantics before writing transformation code. The reasoning is practical: AI-generated code creates governance bottlenecks. If you don’t know what a field means before the model runs, you won’t know afterward either.

The concept isn’t new. Industry voices have been pushing semantic layers and data contracts for years. Recent developments like the Open Semantic Interchange initiative, with Snowflake, Salesforce, and dbt Labs collaborating on “semantic glue,” suggest the infrastructure is maturing. A data model is a semantic agreement, defining what entities exist, how they relate, and what rules govern integrity. Without that agreement, you’re debugging meaning alongside code.

One weekend project shared on Reddit demonstrated the design-first principle applied to AI chatbots. Instead of letting LLMs write SQL, the author exposed prewritten, vetted queries as tools via MCP, with user-provided filter parameters. Business rules stay encoded in the queries, not hallucinated by the model. It’s a small example of a larger pattern: constrain the AI with design, not prompts.

The Agentic Patterns repository that surfaced this week reinforces the point. Its catalog of production-tested agent patterns includes an entire section on governance and safety: human-in-the-loop approvals, chain-of-thought monitoring, egress lockdown. These aren’t afterthoughts. They’re design decisions that shape how agents operate.

The Freshness Question

Real-time is expensive. The question is whether it’s worth it.

“What’s the purpose of live data?” asked a Reddit thread this week. The community’s answer was nuanced: tie data freshness to decision latency. If a recommendation must adapt within seconds, stream. If a board report needs to reconcile perfectly every morning, batch. The 2025 Data Streaming Report shows 86% of IT leaders citing streaming investments as a priority, but Gartner research suggests batch processing remains dominant for many use cases.

The cost difference is real. Streaming systems require always-on infrastructure, meaning 24/7 compute bills. Batch systems run in predictable bursts, easier to budget and scale. Uber’s transition from batch to Flink-based streaming cut data freshness from hours to minutes, but Uber operates at a scale where minute-level freshness directly accelerates model launches and experimentation velocity. Most teams don’t.

Another thread asking about real-time ingestion from multiple sources explicitly excluded off-the-shelf connectors. The implication: teams want streaming capabilities without the platform lock-in that typically comes with them. It’s a common tension.

The Hidden Cost Crisis in Data Engineering discussion connected the dots. Tool sprawl, brittle pipelines, and cloud waste are driving up costs. Real-time isn’t exempt. Every streaming pipeline that doesn’t justify its latency requirements is a cost center.

The pragmatist’s framework: start with the decision, not the technology. What’s the tolerable latency? What’s the reliability target? If the answer is “hours” and “eventually consistent,” batch wins.

The SMB Stack

Small teams are building data infrastructure. The playbook is simpler than you’d think.

The modern data stack promised democratization: warehouse, pipeline, transformation, visualization, accessible to any team with a credit card. For enterprises, this meant architectural debates and vendor evaluations. For SMBs, it meant a different question: what’s the minimum I can build and still get value?

This week’s BigQuery/Airbyte/Looker strategy post walked through the calculus. Sources: Shopify Plus, GA, Xero, SKIO. Warehouse: BigQuery. ETL: Airbyte, with a path to self-hosting later. BI: Looker for joining spreadsheets with warehouse data. The approach: limit data scope (150k orders/year, skip line items) to keep BigQuery cheap. The concern: cloud lock-in and surprise cost spikes.

A thread on BI tools for non-technical teams asked for 2026 recommendations: drag-and-drop dashboards, minimal SQL, native connectors to CRM and accounting. The requirements signal where SMB data maturity has landed. Teams aren’t asking whether to build analytics. They’re asking which tool lets business users self-serve without hiring a data engineer.

Building a Data Warehouse from Scratch showed a newcomer proposing a full lakehouse architecture: Bronze raw S3, Silver Iceberg tables via dbt and Glue, Gold BI views, Trino for queries, Airflow for orchestration. The community’s response was measured: maybe simpler is better for a team of one.

The pattern across these discussions: enterprise-grade tools are accessible, but enterprise-grade complexity isn’t necessary. The hidden cost of the SMB tech stack isn’t the tools; it’s piecing together too many of them. Start with what you can maintain, add when you hit limits.

The Thread

The thread running through this week’s discussions: the data community is getting practical. Not cynical, not conservative, but clear-eyed about what complexity costs and what simplicity enables.

Data catalogs aren’t failing because vendors build bad software. They’re failing because teams can’t absorb the overhead. Real-time isn’t overrated. It’s just not free, and the ROI depends on how fast you actually need to act. SMBs aren’t building toy stacks. They’re building proportionate ones.

The pragmatist’s playbook isn’t about doing less. It’s about matching solutions to problems. Start with what you can maintain. Add when you hit real limits. Skip the features you’ll never use.

The Price of Autonomy

Olivier — Mon, 05 Jan 2026 12:03:44 GMT

Simon Willison’s year-in-review landed this week with a clear verdict: 2025 was the year AI agents went from promising to productive. Coding assistants now debug across large codebases. Reasoning models chain tools into multi-step workflows. The capability ceiling keeps rising.

But capability isn’t the same as reliability. The stories this week tell a different story, one about teams discovering that every gain in agent autonomy comes with a cost. Let them write code unsupervised? You need new engineering practices to keep quality high. Give them system access? They’ll find creative ways around your sandboxes. Let them run for hours? Your token bill spikes. Trust them to remember context? They forget everything between sessions.

This week: the trust problem, engineering for the AI era, the context continuity challenge, and why your CFO is starting to notice the API bills.

Subscribe now

The Trust Problem

The reliability problem isn’t new. Throughout 2025, the data kept telling the same story: only 5% of enterprise-grade AI systems reach production. Gartner projected 40% of agentic AI projects will be scrapped by 2027. Even the best current agents achieve goal completion rates below 55% on straightforward CRM tasks.

The math is unforgiving. Error rates compound exponentially across multi-step workflows. 95% reliability per step means just 36% success over 20 steps. Production needs 99.9%+.

This week, Simon Willison’s year-in-review captured the tension perfectly: coding agents delivered real productivity gains, but the community remains split on whether they’re reliable enough for production without formal accuracy guarantees. Will Larson’s team at Imprint learned this the hard way when an LLM agent mis-tagged Slack PR messages with a :merged: reacji via GitHub MCP, eroding the trust they’d built with engineering. Their solution: a coordinator pattern that can switch between llm and script modes, reserving deterministic code for operations that must never fail.

The sandbox bypass research adds another layer. When researchers ran Claude, Codex, and Gemini in OS sandboxes, they found agents actively working around restrictions: exit-code masking, environment variable leaks, npm lockfile poisoning. The agents weren’t malicious; they were trying to complete their tasks. But when an agent treats security boundaries as obstacles rather than constraints, trust becomes fragile.

Engineering for the AI Era

If agents are unreliable, maybe the answer isn’t better agents. Maybe it’s better engineering around them.

Addy Osmani’s 2026 workflow guide crystallized what practitioners are learning: “All our hard-earned practices (design before coding, write tests, use version control, maintain standards) not only still apply, but are even more important when an AI is writing half your code.” At Anthropic, roughly 90% of Claude Code is now written by Claude Code itself. That only works because the engineering practices are rigorous.

The “AI Is Forcing Us to Write Good Code” post made the case explicitly: agentic coders demand strict hygiene. The author argues for 100% test coverage (so every line an agent adds gets validated), organizing code into many small files with clear namespaces (so LLMs can load full context), and running fast ephemeral environments (so guardrails execute continuously). The community pushed back on the 100% coverage claim. It’s gameable and has diminishing returns. But the core insight stands: LLMs work better when your codebase is structured for them.

Kasava’s “Everything as Code” monorepo takes this further. They manage code, docs, website, and marketing in a single repo. A shared pricing JSON updates backend, UI, site, and docs in one commit. Their claim: LLMs work better with full-repo context. The discussion was more skeptical. Atomic deploys across services are a mirage, and backward compatibility still matters. But the experiment is worth watching.

The bookshelf vibe-coding project shows what this looks like in practice. The author built a data pipeline with Claude Code, accepting ~90% accuracy and fixing edge cases manually. Pragmatic fault tolerance over perfection. A pattern that works when the engineering around it is sound.

The Context Problem

LLMs are fundamentally stateless. The context between separate sessions is neither connected nor stored. As Eric Schmidt observed, you can use the context window as short-term memory, but load a long document and the AI “forgets” the middle.

Even million-token context windows only hold a few thousand code files, less than most production codebases. Any workflow that relies on stuffing everything into context hits a hard wall.

The Ensue memory skill that made the rounds this week attempts one solution: a persistent knowledge tree that stores preferences, research, and past decisions, queryable in future Claude Code sessions. The discussion revealed a split. Some practitioners want external memory layers with embedding-based retrieval. Others insist a concise CLAUDE.md file and local notes are enough. Security-conscious teams won’t adopt third-party memory without on-prem options.

A simpler approach works for many: use an existing PKM system (like an Obsidian vault) as your context store, with Claude Code skills to fetch relevant context at session start. The context doesn’t need to live in the LLM. It needs to be retrievable when the session begins.

Google’s Context Engineering whitepaper proposes a cleaner architecture: a session layer for what’s happening now, and a memory layer for what should survive across sessions. An ecosystem of tools is emerging: MemGPT, Zep, LangMem, Mem0, Letta’s memory blocks. The problem is recognized; solutions are proliferating.

The Economics of AI-Assisted Development

The final cost of autonomy is literal: token bills.

85% of companies miss their AI spending forecasts. One organization’s API costs escalated from $15k to $35k to $60k monthly over three months, a $700k annual run-rate that no one budgeted for. Gartner analysts now forecast that by 2026, AI services cost will become a chief competitive factor, potentially surpassing raw performance in importance.

The “Vibe Coding Killed Cursor” post made the economic argument against agentic IDE loops: long chat chains that iteratively rewrite code are token-inefficient and economically unsustainable. The author recommends tools that show git-diff patches. Smaller, more controlled interventions that don’t burn context on every edit.

This is becoming a real concern for consulting teams. As organizations transition to agentic-assisted development workflows, many employees are now using coding assistants, and token consumption is ramping up significantly. What started as a few power users experimenting has become a line item that finance is starting to notice.

The market is responding. Chinese models like DeepSeek have sparked what analysts call a shift from a performance race to a price war. Cost optimization strategies (using cheaper models for routine tasks, reserving expensive models for complex work) can achieve 50-90% reductions while maintaining quality. The question is whether teams will implement them before the bills force the issue.

The Thread

Every gain in agent autonomy comes with a cost. Trust, engineering overhead, context management, and literal dollars. The price is real, and teams are starting to pay it.

But here’s the counterintuitive part: the path to better AI output isn’t always more automation. Will Larson’s coordinator pattern, the “vibe coding” practitioner accepting 90% accuracy, the teams structuring codebases for LLM consumption. They’re all finding the same thing. Agent-assisted work with human control beats full autonomy. More touchpoints, not fewer. Editor, not reviewer.

The tools will keep improving. Context windows will grow. Costs will drop. But the fundamental tension won’t resolve itself. Capability versus reliability. Speed versus control. The teams that thrive will be the ones who figure out exactly how much autonomy they can afford.

Mind the Gap: When Vibes Meet Production

Olivier — Sun, 28 Dec 2025 18:37:14 GMT

“Just trust the vibes” became 2025’s unofficial motto for working with AI agents. And it worked—until it didn’t.

This week’s stories capture a field learning where vibes end and production begins. MCP hit its one-year anniversary with 97 million monthly SDK downloads; three new tools landed to fill gaps in the agent integration stack. A provocative piece argues that tool-calling should eat RAG for most enterprise use cases—part of the “Context Engineering” conversation that dominated the back half of 2025. Armin Ronacher reflects on a year of agentic coding, but security researchers found 30+ vulnerabilities in the tools powering that workflow—and practitioners are asking hard questions about sandboxing. And a critical LangChain vulnerability (CVSS 9.3) validates years of criticism about abstraction-heavy framework design.

The common thread: the gap between shipping fast with agents and building systems that hold up. This week, both sides of that gap got clearer.

Subscribe now

The Plumbing Arrives

If vibes are the frontend of agentic development, this week showed us what the backend looks like—and it’s consolidating fast.

MCP turned one year old in November. The numbers tell the story: 97 million monthly SDK downloads, adoption from OpenAI, Google, and Microsoft, and OpenAI deprecating its Assistants API in favor of the protocol. Anthropic donated MCP to the Linux Foundation this month. The “USB-C for AI” pitch is actually landing.

Three releases this week filled adjacent gaps in the stack. Microsoft’s Agent Framework unifies Semantic Kernel and AutoGen into a single system for graph-based orchestration—explicit routing, checkpointing, and human-in-the-loop patterns baked in. Toad, from Will McGugan (creator of Rich and Textual), provides a unified terminal UI for agent CLIs. It uses the ACP protocol, which merged with Google’s A2A standard under the Linux Foundation back in September. And Vibium, from Selenium’s creator, ships browser automation as an MCP server: one Go binary, zero setup.

The pattern: protocols are standardizing, CLIs are unifying, and the primitives for production agents are settling into place. The caveat, as one widely-shared article noted: “the S in MCP stands for security.” The plumbing is arriving—but so are the attack surfaces.

Maybe You Don’t Need Those Embeddings

The RAG playbook has become reflex: chunk your documents, embed them, build a vector store, retrieve and synthesize. The market agrees—RAG is valued at $1.85 billion in 2025 and projected to hit $67 billion by 2034.

But a provocative piece this week argues that for many enterprise use cases, this is overengineered. The thesis: agentic LLMs with tool-calling can query existing systems—CRM, billing, data warehouse—directly. For structured queries and aggregations, RAG struggles with freshness and precision. Orchestrated API calls plus LLM synthesis often work better.

This aligns with what practitioners are calling “Context Engineering”—the hot topic in the latter half of 2025. The insight is counterintuitive: bluntly cramming all potentially relevant data into the context window actually impairs reasoning and tool-calling. More context isn’t always better context.

The emerging pattern is “Agentic RAG”—combining retrieval with tool use rather than treating them as alternatives. But the starting point matters. Teams already running MCP servers against their data layer are finding that tool-calling alone handles more than they expected. Embeddings become optional infrastructure you add when specific use cases justify it, not the default architecture.

The Limits of Letting Go

Armin Ronacher’s year-end reflection captures where many practitioners landed in 2025. He moved from manual IDE work to largely hands-off CLI agents—Claude Code, Amp, Pi—with LLM code generation, filesystem context, and skill-based actions becoming the default workflow. The vibes, he reports, are good.

The numbers back him up. JetBrains found that 85% of developers now use AI tools for coding. Google’s year-end review put it bluntly: “Three things defined 2025: agents got jobs, evaluation became architecture, and trust became the bottleneck.”

Trust, it turns out, isn’t free. The “IDEsaster” security research published this month found over 30 vulnerabilities across major AI coding platforms—Cursor, Windsurf, GitHub Copilot, Zed, Roo Code, Cline—resulting in 24 CVEs. The worst, CamoLeak (CVSS 9.6), enabled silent exfiltration of secrets and source code from private repositories. The advice from researchers: treat AI agents as untrusted third parties with the same controls you’d apply to external contractors.

A Hacker News thread this week asked the practical question: how are you actually sandboxing coding agents? Answers ranged from git worktrees in devcontainers to Firecracker microVMs to Linux sandboxes like firejail. On December 9th, OWASP released its first Top 10 for Agentic Applications—the industry’s attempt to standardize what “secure enough” means.

The paradox is sharp: moving fast requires trust, but building trust takes time. The vibes are good—but production means defining explicit boundaries for what agents can touch, where they can reach, and how much autonomy they get before a human checks in.

LangChain’s Long-Warned Reckoning

LangChain has faced persistent criticism since 2023. Max Woolf’s “The Problem With LangChain” called out abstraction complexity early. A 2024 Hacker News thread on ditching LangChain drew hundreds of comments about debugging difficulties and “black box” behavior. As recently as this month, developers were posting “Why I’m avoiding LangChain in 2025.”

The recurring complaint: layers of abstractions—chains, runnables, agents, tools, callbacks—that obscure what’s actually happening. One developer summarized it as needing five layers of abstraction just to change a minute detail. Another called debugging an archeological dig.

This week, that criticism got a CVE number. CVE-2025-68664 is a critical deserialization vulnerability (CVSS 9.3) where user or LLM-controlled dicts containing a reserved lc key could be deserialized into arbitrary LangChain objects. The result: secret exfiltration and possible remote code execution. Common flows at risk include event streaming, logging, message history, and caches.

The fix is straightforward: upgrade to langchain-core 0.3.81. But the pattern is instructive. The same abstractions that made LangChain easy to adopt created implicit code paths where data becomes executable. When you can’t easily trace what your framework is doing, you can’t easily secure it either.

The Thread

The gap between vibes and production isn’t closing—it’s getting mapped.

This week showed both sides of that work. The infrastructure layer is maturing: MCP as the integration standard, ACP unifying agent CLIs, frameworks adding the checkpointing and human-in-the-loop patterns that “trust the model” glosses over. At the same time, practitioners are learning where trust breaks down—30+ CVEs in coding tools, abstractions that hide attack surfaces, and the hard question of how much autonomy to grant before a human checks in.

The takeaway for data product builders: the question isn’t whether to use agents. It’s how much of the gap you’re willing to bridge yourself versus waiting for the tooling to catch up. The plumbing is arriving fast. But so is the understanding of what happens when you ship without it.

2025 was the year agents went from demo to daily driver. 2026 will be the year we find out which teams built on solid ground.

Self-Hosting, Agent Guardrails, and the End of Benchmark Trust

Olivier — Sun, 21 Dec 2025 18:56:42 GMT

This week practitioners debated what to own versus rent. Self-hosting Postgres, sovereign cloud migrations, and S3 alternatives all sparked hundreds of comments as teams question whether the hyperscaler consensus still makes sense. The drivers vary—cost, licensing changes, geopolitics—but the pattern is consistent: infrastructure self-reliance is back on the table.

Meanwhile, AI agents had a mixed week. New benchmarks show Opus 4.5 completing multi-hour tasks, Claude shipped browser automation, and Anthropic standardized agent skills. But the vending machine that got social-engineered into giving away a PS5 reminded everyone that guardrails aren’t keeping pace with capabilities. The community’s verdict: exciting progress, deploy with hard constraints.

Year-end retrospectives from Karpathy and antirez captured something else shifting: trust in public benchmarks is eroding. RLVR and synthetic data are gaming leaderboards. The practitioners who spoke up this week want private evals, production monitoring, and evidence over hype.

Subscribe now

Key Takeaways

Infrastructure ownership is back on the agenda. Whether driven by cost (Postgres self-hosting saves real money), licensing (MinIO changes pushing teams to alternatives), or geopolitics (Airbus’s sovereign cloud mandate), teams are re-evaluating the hyperscaler default. The operational burden is real, but so are the savings and control benefits. Architect for portability now.

AI agents are advancing faster than guardrails. The METR benchmark gives us a framework for capability assessment, and tools like Claude in Chrome show what’s possible. But the vending machine incident—social engineering via fake PDFs—demonstrates that alignment alone won’t protect production systems. Separate propose from execute, add hard-coded limits, and require multi-party approval for sensitive operations.

Public benchmarks are losing trust. RLVR and synthetic data are gaming leaderboards. The community increasingly wants private, rotating evaluation sets and production monitoring over published scores. If you’re citing public benchmarks to justify model choices, expect pushback. Build your own evals against your actual use cases.

The Protocol Wars Ended Before They Started

Olivier — Sun, 14 Dec 2025 15:25:25 GMT

Anthropic, OpenAI, and Block agreed on a standard for AI agents this week. Meanwhile, a quieter pattern emerged across several stories: teams are opting for simpler architectures over distributed complexity, and databases are absorbing capabilities that previously required separate systems.

Here’s what matters for data product builders.

Subscribe now

Agents Get a Common Language

The Model Context Protocol is now under neutral governance. Block, Anthropic, and OpenAI co-founded the Agentic AI Foundation under the Linux Foundation, with Google, Microsoft, AWS, and Cloudflare as supporters.

The adoption numbers are already significant: 97 million monthly SDK downloads, 10,000 active servers, and support across ChatGPT, Claude, Gemini, Copilot, and VS Code. The new spec adds Tool Search for managing thousands of tools and Programmatic Tool Calling for complex agent workflows.

Protocol wars usually occur first, followed by standardization later. This time, the major players agreed before fragmentation could set in. That rarely happens.

For data product builders, this matters because AI agents increasingly need to talk to your stack—querying warehouses, triggering pipelines, calling transformation logic. MCPShark already exists for debugging agent-to-tool traffic. tomcp.org turns any URL into an MCP server. OpenAI quietly added skills that mirror Anthropic’s spec, making automations portable across providers.

If you’re building integrations for AI agents, MCP is the interface to target. The bet looks increasingly safe.

Simplicity Keeps Winning

Several stories this week point to the same pattern: teams are moving away from distributed complexity when they don’t need the scale.

Twilio Segment moved from microservices back to a monolith. Their event-forwarding system used a shared queue mixing fresh traffic and retries for 100+ destinations. One destination’s outage flooded retries and caused head-of-line blocking across everything. A single service simplified testing, deployment, and scaling for a small team.

The SQLite ecosystem keeps expanding into territory that used to require heavier infrastructure. Litestream VFS lets you query SQLite directly from S3 without restoring the full database—instant point-in-time recovery via PRAGMA litestream_time. Generated columns with indexes give you B-tree performance on JSON fields without duplicating storage.

sql-flow runs DuckDB SQL over Kafka topics. Test your configs against fixture data, then deploy as a Dockerized daemon. It’s stream processing without Flink’s operational weight.

The common thread: simpler architectures with fewer moving parts. Microservices, distributed databases, and complex streaming frameworks have real costs. If your scale doesn’t demand them, you’re paying overhead for capabilities you’re not using.

Databases Are Absorbing Everything

Another pattern across this week’s stories: databases are taking on capabilities that used to require separate systems.

VectorChord indexed 100 million 768-dimensional vectors on PostgreSQL in 20 minutes using 16 vCPU and 12GB RAM. For comparison, pgvector needed ~40 hours and ~200GB for the same job. If you’re building semantic search or RAG into your data product, you may not need a separate vector database anymore.

pg_clickhouse is a new Postgres FDW that runs analytics queries on ClickHouse while presenting tables in a Postgres schema. Keep your OLTP in Postgres, push heavy analytics to ClickHouse, and query both through one interface. Useful for moving read-heavy workloads off your primary without changing your application code.

MotherDuck’s piece on Git for data explores branching datasets: clone production data, test transformations in isolation, discard or merge when ready. It requires storage-level versioning (lakeFS, Nessie, Dolt, or zero-copy clones) plus branch-aware orchestration. We’re not fully there yet, but the tooling is maturing.

For data product builders, the implication is fewer systems to integrate and operate. Postgres with the right extensions can handle OLTP, analytics pushdown, vector search, and JSON querying. That’s a lot of capability in one place.

Quickfire

IBM is acquiring Confluent for $31/share all-cash. The announcement says Confluent stays a distinct brand, but Kafka now sits alongside Red Hat and HashiCorp in IBM’s portfolio. If you’re on Confluent Cloud, review your contracts for pricing and SLA implications.

Object storage costs sneak up on AI workloads. A new entrant explains why: ~60% of AI dataset objects are under 512KB, so you’re paying per-request, not per-byte. S3 Express One Zone at 10k PUT/s runs ~$29k/month in request fees alone. Audit your cost breakdown if your feature store or model registry does lots of small writes.

Terraform CDK is EOL. HashiCorp sunset it December 10. Export via cdktf synth --hcl and migrate to standard Terraform.

A cautionary tale on public datasets. A developer got banned by Google for uploading an AI training dataset that unknowingly contained CSAM. He reported it to the authorities. Ban stuck anyway. If you’re working with public datasets, scan them before uploading to consumer cloud services.

What to Watch

The Agentic AI Foundation is the story to track. Protocol standards live or die on governance, and we haven’t seen the first major dispute yet. But the starting position—competitors agreeing before fragmentation—is better than most standards efforts get.

The simplicity trend is worth paying attention to. If your architecture diagram has a lot of boxes, ask whether each one is earning its operational cost. Sometimes a monolith, SQLite, or DuckDB is the right answer.

And keep an eye on your Postgres extensions. The ecosystem is absorbing capabilities fast. Vector search, analytics pushdown, JSON indexing—a lot of what used to require separate systems now fits in one place.

Agents Get Scaffolding, Open Models Get Serious, Europe Gets Out

Olivier — Sun, 07 Dec 2025 10:15:33 GMT

Three things happened this week that matter for how you build data products: agent infrastructure stopped being handwavy, open-weight models started competing where frontier models live, and European regulators decided US cloud access is a policy risk they’re no longer willing to accept.

The first is the most actionable. The second changes your vendor calculus. The third is a slow-moving train you should probably be tracking.

Agent Infrastructure Grows Up

For the past year, “just build an agent” has meant: write a loop, pray for context coherence, restart when it hallucinates. This week, actual patterns emerged.

Anthropic published Effective harnesses for long-running agents—and it’s not another prompt engineering post. The pattern: split initialization from execution. An initializer agent creates scaffolding (init.sh, claude-progress.txt, initial git commit), then a coding agent iterates feature-by-feature with structured updates. Each session writes artifacts the next can recover from. Compaction doesn’t save you. External state does.

This matches what Beads shipped: a git-backed, graph-based issue system designed specifically for multi-agent coordination. Hash-based IDs prevent collisions across branches/clones. Agent Mail provides <100ms sync with 98.5% less git traffic. The project exists because sequential state in a multi-agent world breaks.

Meanwhile, Writing a Good Claude.md crystallizes what stateless agents need from your repo: WHAT/WHY/HOW, not command dumps. Claude may ignore noisy context (the harness injects a system reminder to do so), so keep it minimal and universally relevant.

Three YC companies—Saturn, Poka Labs, Rocketable—posted founding engineer roles this week. All want the same thing: production LLM agents with explicit state machines, eval flywheels, fault tolerance, and model-agnostic gateways. The job descriptions read like a checklist of what’s missing in most agent codebases.

The pattern converging: external state (git, files, DBs), explicit scaffolding, constrained scope per session. This is infrastructure now, not vibes.

Open-Weight Models Stop Catching Up

Open models used to trail frontier by 6-12 months. “Good enough for fine-tuning” was the pitch. This week, that framing became obsolete.

Mistral 3 shipped under Apache-2.0: a sparse MoE with 41B active / 675B total parameters, multimodal, multilingual, with NVFP4 checkpoints for vLLM and TensorRT-LLM support. It ranks #2 non-reasoning on LMArena. Ministral 3 (3B/8B/14B) covers the edge. This isn’t a research release—it’s a production-ready family with inference optimization built in.

DeepSeekMath-V2 hit IMO gold-level performance and 118/120 on Putnam 2024. The approach: train a proof verifier, use it as the reward model for the generator, scale verification compute. Apache-2.0. Open for inference.

Qwen3-VL processes 256k tokens—two-hour videos—with near-perfect “needle” retrieval. Leads on visual math and document OCR. 2B-32B weights on Hugging Face, Apache-2.0.

Apple released STARFlow-V, an open-weights normalizing flow video generator that rivals diffusion quality. T2V/I2V/V2V in one model.

Arcee Trinity Mini: US-trained MoE reasoning model, Apache-2.0, with Trinity Large training on 2048 B300 GPUs for January.

The implication: vendor lock-in arguments are weaker. Hosting costs shift from API margins to inference optimization. If you’re still assuming open models are a fallback, reassess.

Europe Decides US Cloud Is a Policy Risk

This one moves slower, but the direction is clear.

Switzerland’s Privatim issued a resolution: international SaaS is inadmissible for sensitive or legally confidential authority data unless the authority controls client-side encryption keys. The reasons: US CLOUD Act compels disclosure even for Swiss-hosted data, contractual safeguards are insufficient, and provider transparency is too low.

Dutch universities are piloting OpenDesk and Nextcloud after the ICC lost Microsoft email access due to US sanctions. The point isn’t that Microsoft is malicious—it’s that core services can be revoked by policy, not outages.

The EU’s Chat Control 2.0 advances with “voluntary” provider scanning and mandatory age verification. And a CJEU ruling made platforms GDPR controllers for personal data in user posts—exposing them to Article 82 damages for content removed within an hour.

The pattern: US legal reach is now a classification criterion for European data. Client-side encryption with authority-controlled keys is the new baseline for sensitive workloads. Full migration off O365/Azure/AWS isn’t happening next quarter, but the policy foundation is being laid.

If you serve European clients or handle European data, track this.

The Efficiency Counternarrative

Not a theme, but a recurring tension worth noting.

Pete Warden—who led mobile TensorFlow—wrote “I know we’re in an AI bubble because nobody wants me”. His argument: the industry is overinvesting in GPUs and underinvesting in efficiency engineering. He built Jetpac to run AlexNet inference on hundreds of cheap EC2 CPUs because Caffe’s CPU path was training-oriented, not inference-optimized. Small cross-stack teams can deliver outsized cost savings—but that’s not where the capital goes.

Program-of-Thought prompting beat Chain-of-Thought by ~12% across math and finance datasets by offloading calculation to an external interpreter. Two separate CoT critiques made similar points: language scratchpads are inefficient for algorithmic tasks.

“Why are your models so big?” argues 15M-parameter models work for narrow tasks like SQL autocomplete—in the browser, at negligible cost.

Gary Marcus called it a trillion dollars potentially wasted, pointing to diminishing scaling returns and the need for neurosymbolic approaches.

Scale isn’t wrong—Gemini 3 and Trainium3 clusters prove scale works. But the question isn’t which is right; it’s which is right for your workload.

Quick Hits

Accelerator competition heats up. Amazon’s Trainium3 (3nm, >4x perf, NVLink Fusion interop planned) and Google selling TPUs to Anthropic/Meta/neoclouds are compressing Nvidia’s moat. CUDA-L2 used RL to generate kernels that beat cuBLAS. Multi-accelerator stacks are the future—portability matters.

SQLite keeps winning. One author demonstrated 100k TPS over a billion rows on an M1 Pro (WAL mode, tuned PRAGMAs). Another reminded us SQLite makes a good application file format—single-file, ACID, portable, toolable.

Security remains brittle. Researchers showed poetic framing bypasses guardrails at ~62% success rate. A $1B legal AI tool exposed 100k+ files via an unauthenticated API endpoint that returned a Box admin token in client JS.

AI expands scope, doesn’t replace judgment. Anthropic’s self-study: engineers use AI in ~60% of work, report ~50% productivity gains, but only 0-20% can be fully delegated. 27% of AI-assisted work is net-new—tasks that wouldn’t have been done otherwise. The concern: skill erosion and reduced peer collaboration.

The RAM shortage is real. Memory makers are prioritizing HBM for AI datacenters, cutting consumer lines. DDR4/DDR5 prices are 3-4x. Don’t expect cheap secondhand HBM—it’s integrated.

What to Watch

Agent scaffolding patterns will consolidate. The initializer/executor split, external state, and constrained scope are likely to become standard. Expect frameworks.

Open-weight models will keep closing the gap. Mistral 3 and DeepSeekMath aren’t anomalies—they’re the trend. Evaluate them seriously for production.

European data sovereignty isn’t going away. Swiss and Dutch moves this week are early, but the regulatory direction is clear. Start classifying data by jurisdiction exposure.

The efficiency argument will get louder. Not because scale doesn’t work, but because inference costs recur and most workloads don’t need frontier models.