The Unsexy Work of Making Things Actually Work in Production

The Data Report - Week ending November 28, 2025

Nov 28, 2025

Introduction

Anthropic shipped Claude Opus 4.5 with advanced tool use, Ilya Sutskever declared the age of scaling over, an npm supply chain attack hit 492 packages including Postman and Zapier, Google announced its seventh-generation TPU with 9,216-chip superpods, and Swiss data protection officers effectively banned international cloud providers for sensitive government data. Same week.

One way to read all of this: the infrastructure layer is scrambling to catch up to what we’ve been promising. Model capabilities outran agent tooling. AI deployment outran security models. Training scale outran useful improvement. Now the bill is coming due.

This report identifies four patterns emerging from the convergence: agent infrastructure finally getting serious attention, the scaling era giving way to something else, security assumptions being actively dismantled, and compute infrastructure preparing for an inference-dominated future. The through-line is operational maturity—the unsexy work of making things actually work in production.

Theme 1: The Agent Infrastructure Gap

The Pattern: Everyone shipped agents in 2024. In 2025, everyone is shipping the infrastructure to make agents not break.

Evidence:

Claude Advanced Tool Use - Anthropic introduces Tool Search Tool (on-demand MCP discovery), Programmatic Tool Calling (loops/conditions via code), and Tool Use Examples. The result: “~85% token reduction with higher accuracy.” The fact that this needs to be engineered tells you how far raw model capability was from production reliability.
Agent Design Is Still Hard - Armin Ronacher (creator of Flask) on lessons from building LLM agents: “High-level SDKs break with provider-side tools… prefer explicit cache management… isolate failures.” The detailed practical guidance suggests this isn’t solved by prompting harder.
Why Senior Engineers Struggle to Build AI Agents - “AI agents aren’t deterministic programs. Seniors often over-constrain them with strict schemas, hard-coded flows, and unit tests.” The recommendation: treat text as first-class state, let the LLM own control flow, replace unit tests with behavioral evals.
MCP Apps Extension - OpenAI and Anthropic jointly proposing standardized interactive UIs in Model Context Protocol. The fact that competitors are collaborating on infrastructure suggests the problem is bigger than competitive differentiation.
Google Antigravity Exfiltrates Data - Researchers demonstrate indirect prompt injection against Gemini’s code editor: poisoned web content instructs the model to read .env files (bypassing .gitignore via shell ‘cat’), then exfiltrate credentials to webhook.site. Agent capabilities created attack surface that security models haven’t caught up to.

Why It Matters: The gap between “impressive demo” and “production deployment” for AI agents is infrastructure, not model capability. Tool orchestration, context management, failure handling, and security are the actual blockers. Anthropic dedicating engineering resources to Tool Use Examples—teaching models how to use similar-looking APIs correctly—is a tell. The abstractions we need don’t exist yet, and the ones we built are actively being broken.

Theme 2: The Scaling Reckoning

The Pattern: Three of the most influential voices in AI said variations of “scale is done” in the same week. The industry is listening.

Evidence:

Ilya Sutskever Interview - “We’re moving from the age of scaling to the age of research.” Pretraining has limits. Models show “jaggedness”—great on evaluations, brittle in deployment. Generalization improvements need new objectives beyond next-token prediction.
Sutskever and LeCun: Scaling Won’t Yield More Useful Results - LeCun advocates alternatives to LLMs (world models, JEPA). The consensus: benchmark performance doesn’t translate to real-world utility, and adding GPUs no longer fixes that.
A trillion dollars (potentially) wasted on gen-AI - Gary Marcus on diminishing returns from Kaplan scaling laws. Recommends shifting roadmaps from “bigger LLM” to hybrid neuro-symbolic designs and task-specific constraints.
Claude Opus 4.5 - Notably, Anthropic’s marketing emphasizes efficiency: “~15% better Terminal Bench vs Sonnet 4.5 with fewer tokens.” The competitive differentiator is doing more with less, not doing more with more.

Why It Matters: For data product practitioners, this shifts the planning horizon. The “wait for the next model” strategy is losing coherence. Post-training improvements (RLHF, tool use, process supervision), retrieval augmentation, task-specific fine-tuning, and hybrid approaches are where the returns are. Build for the models we have, not the models we were promised.

Theme 3: Security Through Exfiltration

The Pattern: The attack surface expanded faster than security models. This week documented the consequences.

Evidence:

SHA1-Hulud NPM Attack - 492 packages (~132M monthly downloads) compromised, impacting Postman, Zapier, PostHog. The payload: install Bun, run TruffleHog to find secrets, exfiltrate to random GitHub repos. Can infect up to 100 packages per host. Timed before npm’s Dec 9 classic token revocation.
Stop Putting Passwords into Random Websites - watchTowr scraped 80k+ publicly saved JSON snippets from JSONFormatter and CodeBeautify. Found: AD credentials, database credentials, cloud keys, JWTs, API tokens, PII, even an AWS Secrets Manager export. Root cause: developers paste real payloads and hit “save.”
JDBC Driver Audit - $85k Bounty - LLM-assisted audit of JDBC drivers found Databricks user-controlled StagingAllowedLocalPaths enabling arbitrary local file read/write, chained via Git .git/config sshCommand to RCE. A separate Exasol driver bug allowed arbitrary file reads.
ZoomInfo Pre-Consent Biometric Tracking - Researcher documented pre-consent mouse/typing capture via decoded config: enableBiometrics: true tied to Sardine.ai. 118 tracking domains. After posting evidence, CEO blocked the comment.
US Banks Scramble After SitusAMC Breach - Data exfiltration from fintech vendor SitusAMC. JPMorgan, Citi, Morgan Stanley notified. As a processor of billions of loan documents, the blast radius of non-public banking data is significant.

Why It Matters: The perimeter doesn’t exist anymore. Your attack surface includes every SaaS tool where developers paste data, every npm package in your dependency tree, every JDBC driver connection string, every third-party vendor processing your data, and every AI agent with file access. Traditional security models assume boundaries. The evidence this week: there are no boundaries, only exfiltration opportunities.

Theme 4: Infrastructure for the Inference Era

The Pattern: Major infrastructure announcements this week share a common assumption: inference demand is about to dwarf everything else.

Evidence:

Google Ironwood TPU - Seventh-generation TPU with >4x performance per chip, scaling to 9,216-chip superpods via 9.6 Tb/s interconnect with 1.77 PB shared HBM. “Purpose-built for high-volume, low-latency inference.”
Building the Largest Known Kubernetes Cluster - 130k Nodes - GKE at 130k nodes with 1,000 Pods/sec and >1M objects in distributed storage. Enabled by control-plane changes: Consistent Reads from Cache (KEP-2340) and Snapshottable API server cache (KEP-4988).
TPUs vs GPUs Deep Dive - Technical analysis of TPU systolic-array design vs GPU general-purpose architecture. TPUs stream data through on-chip MACs, reducing memory traffic. Origin story: a 2013 projection that 3 minutes/day of Android voice search would double Google’s data center capacity.
Seagate 6.9TB Per Platter - HAMR + Mozaic 3+ enables 55-69TB 3.5-inch drives. Production 6.9TB platters targeted for 2030. HDDs remain best $/TB. “Datacenter backorders reportedly ~2 years due to AI demand.”

Why It Matters: The infrastructure layer is betting heavily that inference—serving models at scale—is the next bottleneck. Google’s moves (TPUs for inference, 130k-node clusters, DeepMind co-design) position for a world where training happens occasionally but inference happens constantly. The two-year datacenter backlog suggests this isn’t speculative; the capacity is already sold.

Meta-Observation: Operational Maturity as the Differentiator

Strip away the announcements and you’re left with a consistent pattern: the industry is pivoting from capability to reliability.

Agents need infrastructure, not just model improvements. Scaling hit diminishing returns; the gains are in post-training and efficiency. Security is being actively tested against the new attack surfaces. Infrastructure is preparing for inference, not training. Even governance is catching up—Swiss authorities effectively banned international cloud for sensitive data, the DOJ constrained algorithmic pricing models, and CERN published AI principles requiring human accountability.

The work that matters now is the unsexy work: tool orchestration that doesn’t break, security models that assume no perimeter, cost controls that scale, and deployment patterns that actually work. The demo phase is over. The operations phase is beginning.

For data product practitioners, the implication is concrete: the constraint has shifted. It’s no longer “can we build this?” It’s “can we operate this?” Build accordingly.

Looking Ahead

Questions to explore:

How does agent reliability get measured and standardized? Anthropic’s behavioral evals are a start, but where’s the industry convergence?
If scaling is done, what does the investment landscape look like? Which post-training approaches actually compound?
Supply chain attacks on developer tooling (npm, JDBC, paste sites) suggest a pattern. What’s the next vector?
Inference infrastructure is scaling. Who captures the economics—cloud providers, chip vendors, or something new?

Methodology Note: This analysis covered all 116 stories published in the past 7 days. Stories were classified by depth: Tier 1 (58 high-signal stories: releases, deep-dives, research) anchored themes; Tier 2 (36 substantive discussions) supported patterns; Tier 3 (22 surface-level questions) were noted for meta-patterns only. Themes were identified by analyzing the complete dataset with depth-weighted prioritization.

Discussion about this post

Ready for more?