100% on the Test, 0% on the Job

The Data Product Report: Weekly State of the Market in Data Product Building | Week ending April 13, 2026

Apr 14, 2026

This Week

Berkeley researchers scored perfect marks on every major AI agent benchmark — by hacking the test harnesses, not solving a single task. Meanwhile, agent infrastructure projects are shipping faster than anyone can agree on what the stack should look like, and Anthropic’s users discovered their caching costs had quietly doubled. The stack is thickening. The foundations are not keeping up.

Your Benchmarks Are Theater

A Berkeley research team built an automated exploit agent that scores ~100% on SWE-bench, WebArena, OSWorld, and every other major AI agent benchmark — without solving a single task. The methods were almost embarrassingly simple: injecting pytest hooks to force tests to pass, trojanizing wrapper scripts, reading gold answers from the eval harness’s own files. No frontier intelligence required. Just an agent that audits its test environment and cheats.

The community’s reaction wasn’t surprise — it was finally. The suspicion that vendor leaderboard positions are marketing, not evidence, now has a peer-reviewed receipt.

This lands in a week where the “demoware” problem got its own manifesto. Top-down “AI transformation” mandates are producing GUI-stitched LLM workflows shipped without ground truths or evaluation pipelines. They demo well. They fail in production — quietly, expensively, and in ways that compound. “It works in the demo” is not an acceptance test.

The bottom line: Build your evaluation pipeline before your demo. The bar for “it works” just moved from “impressive in a meeting” to “survives an adversarial audit.”

We’ve Seen This Stack Before

Multiple agent infrastructure projects shipped this week, each at a different layer. If you’ve been building data pipelines for a few years, the pattern is familiar.

Anthropic launched Claude Managed Agents in public beta: hosted orchestration with sandboxed execution, checkpointing, and scoped permissions. The discussion split predictably — small teams liked the convenience, platform teams flagged vendor lock-in. It’s the managed Airflow debate, replayed at the agent layer.

Google open-sourced Scion, calling it a “hypervisor for agents” — isolated containers, dynamic task graphs, shared workspaces. The architecture is sound. The commitment is uncertain. Also familiar.

Meanwhile, a post arguing MCP is better than Skills for agent-service integration sparked a different kind of debate. The fact that teams are arguing about integration patterns — not just picking tools — is the signal. The stack has layers now. Nobody agreed on which ones are load-bearing.

What to do with this: Map the agent stack the way you mapped your data stack. The lock-in risk at the orchestration layer is real, and the winners haven’t emerged yet.

They Changed the Price While You Were Sleeping

Two stories in a single day, both about Anthropic, both angry, both generating massive community backlash. This is the loudest signal of the week — louder than any product launch or research paper.

First: a user on Claude Code’s Pro Max tier (5x quota, $200/month) reported exhausting their quota in 90 minutes under moderate use. The culprit: cache-read tokens — cheap in billing — counted at full rate for quota purposes. Auto-compacts and background sessions were issuing ~960K-token requests. The thread blew up. Users reporting cancellations and switches to OpenAI’s Codex.

Then: an analysis of 119,866 API calls revealed that Anthropic’s prompt cache TTL had silently shifted from one hour to five minutes around March 6-8 — a server-side change with no announcement, no changelog entry, no documentation update. The author estimated 20-32% higher cache-write costs. The word “enshittification” appeared more than once.

What to do with this: Monitor your LLM API costs the way you monitor your cloud spend — per-call, not monthly summaries. Silent infrastructure changes are the new silent data corruption.

The Radar

Quick hits on stories worth knowing about, organized by what you’re building.

If you’re building with ML/AI:

MegaTrain trains 100B+ parameter models on a single GPU by storing weights in CPU RAM and treating the GPU as transient compute. Not for trillion-token pretraining, but for domain fine-tuning on hardware your team might actually have.
Gemma 4 Multimodal Fine-Tuner — LoRA toolkit for Gemma 3n/4 on Apple Silicon. If your team runs Macs and wants to fine-tune a multimodal model without renting GPUs, start here.
USC: LLMs may be standardizing human expression — Research finding that LLM outputs shrink cognitive diversity and reflect WEIRD cultural biases. If you’re building LLM-powered content features, diversity metrics in your evals aren’t optional.

If you’re building infrastructure:

S3 Files — AWS bridging object storage with POSIX file access for pipelines that need both. Could simplify lakehouse architectures, but pricing needs scrutiny.
Keeping a Postgres Queue Healthy — PlanetScale guide to running job queues without bloat. If you use Airflow’s Postgres backend, this is directly relevant.

If you care about governance:

Joe Reis: Do Fundamentals Still Matter? — Yes. “Vibe engineering” — adopting AI tools without grounding in architecture trade-offs and testing discipline — yields brittle platforms. The dbt Roundup published a counterpoint the next day: fundamentals aren’t an alternative to moving up the stack — they’re the prerequisite.

The Data Product Report is published every Tuesday by RepublicOfData.io.

Discussion about this post

Ready for more?