Your AI Analytics Problem Isn’t the Model

The Data Report: Weekly market signals on modern data platform shifts | Week ending June 22, 2026

Jun 23, 2026

The 30-second version

The same Claude went from 21% to 95% accuracy with no model change. The difference was a governed semantic layer and encoded skills. Your AI accuracy problem is a meaning problem.
Nasdaq’s agent-governance rule: automate the deterministic work, gate the judgment calls. The win isn’t fewer errors, it’s catching the silent ones.
In the Radar: DuckDB’s pitch to be the engine agents actually need, a benchmark that catches training-data cheating, and a shift in how agents get access.

This Week

When an AI tool returns a wrong number, the reflex is to blame the model and wait for the next one. Anthropic spent this week quietly dismantling that reflex: the same Claude went from 21% to 95% accuracy with no model change at 1. The whole gap was the governed semantic layer, the encoded skills, and the validation they wrapped around it.

That sharpens the question every platform owner is already sitting with: what has to be true before you let an agent near the warehouse? Two answers landed this week, one on accuracy and one on governance, plus a provocation in the Radar about whether the warehouse is even the right engine for agents at all.

The Same Model Went From 21% to 95%

Up front: swapping models won’t fix a wrong number. Writing down what your data means will.

The first time an agent hands back a wrong figure, the instinct is to blame the model. This week Anthropic put a price on that instinct. In its own writeup, the same Claude scored about 21% on their internal analytics questions when it queried the warehouse raw. After they made a governed semantic layer mandatory, encoded their repeatable analyses as skills, and added evaluation and monitoring, it reached about 95%, and closer to 99% in some domains.

Hold the model constant and the lift has nowhere to hide. It isn’t a smarter Claude. It’s the knowledge a schema can’t carry: what “active customer” means here, which join is the right one, how this shop counts revenue and the training data didn’t.

The gap was never the reasoning. It was the business meaning nobody had written down in a form a machine could read.

dbt gave that a name this week: “semantic debt.” Every org carries definitions that were never made machine-readable, and humans pay that debt down silently, in meetings, where two analysts notice their churn numbers disagree and quietly settle which one is right. An agent doesn’t get the meeting. It picks one interpretation and scales the inconsistency across every answer it gives.

Bottom line: The teams treating the semantic layer as the prerequisite are shipping. The ones still waiting for a smarter model are stuck near 21%.

Determinism Is the Gate

Up front: stop asking whether to trust the agent. Decide which work is deterministic, and gate the rest.

Say you get the accuracy. Now you have an agent that mostly works, and “mostly” is the scary word if you run data anywhere near a regulator. The failure you can see is fine. The silent one should keep you up: an agent that starts returning wrong answers and tells no one.

The clearest answer this week came from Nasdaq’s data leadership, and it reads like a working playbook, not a governance sermon. The rule at its center is almost rude in its simplicity. Deterministic workflow? Automate it. Judgment call? Keep a human in the loop with a confidence threshold. Entity resolution runs with thresholds and human checks, not blind trust. Agents write code, draft docs, and generate tests. Humans gate the production deploy.

What lifts it above the usual governance talk is that it treats governance as plumbing, not paperwork2. “AI fails silently” is the line their data leader keeps coming back to, and it’s the whole motivation: you build the framework not because the agent will obviously break, but because when it breaks quietly, the framework is the only thing that notices before your customers do.

Bottom line: The rule doesn’t buy you fewer mistakes. It buys you a tripwire, and in a regulated shop, catching the quiet failure is the entire job.

The Radar

🦆 Rethinking the engine under your agents. Jordan Tigani of MotherDuck argues the distributed warehouse was never built for how agents query: dozens of small, throwaway, parallel scans, billed like each one mattered. His fix is an in-process engine like DuckDB, with a one-line promote to managed cloud when a job needs scale. Read it as a pitch, since he sells it.3 But the question costs nothing to ask: are you paying warehouse rates because you measured it, or because that’s where the data already lives?

🎯 Putting an agent on your warehouse. A controlled dltHub benchmark scored 3 out of 10 on raw tables and 10 out of 10 with an explicit ontology. The catch: the model aced famous public datasets with no ontology at all, because it had memorized them in training. Benchmark on your own data or you’ll overestimate. NAB, an Australian bank, reports the production version: 2 to 4 developer-days saved per use case, once the trusted datasets existed first.

🔒 Governance. MosaicLeaks is the first real benchmark of a leak nobody measures: research agents leaking private documents through the search queries they send out. Training the agent for privacy cut it from about 34% to under 10% without hurting the work. And the Model Context Protocol team shipped Enterprise-Managed Authorization, centralizing agent access through your identity provider. Cleaner control, but one less bit of friction on a misbehaving agent.

💸 Chasing warehouse spend. Monte Carlo shipped two agents worth reading together: a Cost Agent that ranks waste by impact and risk, and Agent Lineage that ties a wrong agent answer back to whether the agent reasoned badly or the data underneath shifted. That second one is exactly what the Nasdaq playbook needs to operate.

🧩 Weighing semantic-layer vendors. A trade-press analysis puts McKinsey numbers on the lead story: fewer than 10% of agent pilots scale, and around 80% of the failures cite data and semantics limits. Most vendor layers are still built for dashboards, not for how agents query, so keep your definitions portable before you’re locked in.

Reply and tell me: do you actually know your accuracy floor without a semantic layer, and have you ever measured what those throwaway agent queries cost?

Published by RepublicOfData.io. Curated by Olivier Dupuis.

These are Anthropic’s own figures, on their own data, graded by their own evaluation, not audited. The residual 5% is exactly where the governance burden concentrates, the slice you still can’t let run unwatched. What holds up is the shape, not the decimals. The dltHub benchmark in the Radar makes the same point from the other direction, with the methodology exposed and the data designed against training leakage.

Concretely: an embedded review committee rules on which use cases qualify, how the data is classified, which model is allowed, what validation methodology counts as enough, and what explainability artifacts you have to be able to hand a regulator later.

MotherDuck tested its own argument by replacing its own business-intelligence tool in under a month, handing an agent the dashboard migration and the reconciliation of new numbers against old. Vendor eating its own cooking, so weigh it as such. But the dashboards-as-code shape (definitions in version control, deployed through continuous integration, migrated numbers checked by an agent) is a credible one for anyone facing a BI renewal.

Discussion about this post

Ready for more?