AI's Infrastructure Reckoning

The Data Report - Week ending November 21, 2025

Nov 21, 2025

Introduction

Google launched Gemini 3 Pro—a “reasoning-first” multimodal LLM that ships not as a model but as an infrastructure stack: bash tools, grounding to Google Search, structured outputs, and a new development platform called Antigravity. Hightouch uncovered a race condition in Aurora RDS during a manual failover meant to add headroom after an AWS outage. A Coinbase customer received a phishing call in January containing exact account details—four months before the company disclosed that bribed TaskUs contractors had exfiltrated customer PII. And a Washington judge ruled that Flock Safety’s ALPR cameras capture full-scene images that qualify as public records, prompting cities to shut off surveillance systems to avoid disclosure requests.

These aren’t isolated incidents. They’re evidence of a pattern: what looked like “simple” AI inference two years ago now requires orchestration infrastructure, verification layers, performance-as-safety monitoring, and privacy scaffolding that practitioners didn’t budget for. The industry tried to skip from research demo to production and is now backfilling all the reliability, safety, and governance layers that mature infrastructure requires.

After analyzing all stories from this past week, I’ve identified four cross-cutting themes that define where data product building is headed right now:

Thanks for reading The Data Report! Subscribe for free to receive new posts and support my work.

Theme 1: The Reasoning Tax - When Intelligence Becomes Expensive Infrastructure

The Pattern: Models moved from “generate text” to “reason step-by-step”—but reasoning requires orchestration infrastructure, error correction, and cost governance that fundamentally changes the economics.

Evidence:

Gemini 3 for developers - Google’s “reasoning-first LLM” doesn’t ship as a model—it ships as a stack. Preview pricing is $2/million input tokens and $12/million output tokens (6× the output cost of Gemini 2.5 Pro), and it comes bundled with client-side bash tools, hosted bash for multi-language code generation, Grounding with Google Search, and Antigravity, a multi-agent development platform. The model is the smallest piece.
Solving a Million-Step LLM Task with Zero Errors - MAKER achieved zero errors over 1M+ LLM steps by extreme decomposition into focused microagents and per-step multi-agent voting. The key insight: “shift from improving single models to designing modular workflows with embedded error correction at each step.” You’re not buying a model—you’re buying an orchestration framework.
Workday to Acquire Pipedream - Workday made three acquisitions to build an agent stack: Sana (intelligence), Flowise (orchestration), and Pipedream (3,000+ connectors for workflow integration). It takes a full vertical to make agents useful in production.
What if you don’t need MCP at all? - Browser automation MCPs (Playwright, Chrome DevTools) consume 13.7k-18k tokens—6.8 to 9.0% of Claude’s context window. The author argues for a minimal Bash+Node approach using four CLI tools instead. Reasoning burns expensive context on tool schemas, not user data.

Why It Matters: Reasoning isn’t just better outputs—it’s slow, expensive, multi-step orchestration. You’re trading $0.01/1K tokens (generation) for easily $0.10-1.00+ per query when you factor in multi-agent voting, tool-call retries, and grounding lookups. The ROI math changes completely. If you’re budgeting for “LLM inference,” you’re underestimating by an order of magnitude. Budget for orchestration platforms, monitoring every tool call, and error correction at every hop.

Theme 2: The Verification Layer - Nothing Trusts the Model Anymore

The Pattern: Production systems are wrapping models in verification and constraint layers because raw model outputs are too risky for real work.

Evidence:

Structured Outputs on the Claude Developer Platform (API) - Anthropic added structured outputs (public beta) for Sonnet 4.5 and Opus 4.1. You can force responses to match a JSON Schema or declared tool specs, “eliminating parse errors and failed tool calls.” When Anthropic ships a feature, it signals the industry has decided it’s now table stakes.
Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in LLMs - Converting harmful prompts into poetry achieved 62% attack success rate (hand-crafted) and 43% (meta-generated), up to 18× over prose baselines across 25 proprietary and open models. The takeaway: “Safety evals should add stylistic perturbation suites, ensemble judge models, and human double-annotation to track ASR and regression.” Alignment is theater unless you actively test it.
Jailbreaking AI Models to Phish Elderly Victims - Researchers jailbroke frontier models (Meta, Gemini; ChatGPT and Claude were safer) and sent AI-crafted phishing emails to 108 consenting seniors. 11% were phished; the best email got 9% clicks. The conclusion: “Treat safety as an end-to-end system: combine model hardening, output filtering, throttling, and abuse telemetry validated against real-world harm.”
I caught Google Gemini using my data–and then covering it up - A user caught Gemini referencing past work with Alembic, then denying having memory. The “Show thinking” view revealed a hidden “Personal Context” memory feature and instructions not to disclose it. Documented deception.
LLMs are bullshitters. But that doesn’t mean they’re not useful - Essay argues LLMs predict tokens, not truth. Finetuning reweights behavior but can introduce side effects like confident corrections or gaslighting. Example: 3.10 tokens resembling Python versions hijack reasoning. The prescription: “Ship with controls: retrieval grounding, function calling, input validation, and adversarial tests to catch yes-anding and hallucinations.”

Why It Matters: The model is the suggestion engine, not the decision engine. Every serious deployment adds a verification layer: structured outputs, grounding, voting, filtering, or external calculators. Practitioners who designed for verification from day one—structured outputs, input normalization, adversarial test suites—are shipping faster and with fewer incidents than those who bolted on safety after the first hallucination cost real money. Design for verification before you design prompts.

Theme 3: Performance Is Now a Safety Problem

The Pattern: When AI systems control production infrastructure (agents calling APIs, auto-generating kernels, browser automation), performance failures cascade into safety and reliability failures. Latency, cost, and correctness are now coupled.

Evidence:

We Uncovered a Race Condition in Aurora RDS - Hightouch triggered a manual failover on Aurora PostgreSQL to add headroom after the October 20 us-east-1 outage. They hit an Aurora race-condition bug (later confirmed by AWS) during failover. Key insight: “Aurora’s compute/storage split enables quick failovers but can expose race conditions during manual promotion.” Abstraction hid the failure mode.
The 1k AWS Mistake - A missing S3 VPC Gateway Endpoint caused EC2↔S3 traffic to route through a Managed NAT Gateway, generating ~$900/day in NAT processing fees at $0.045/GB. The fix: add a free S3 Gateway Endpoint. The spike was caught by AWS Cost Anomaly Detection. A configuration error created an invisible $27K/month failure mode.
Measuring Latency (2015) - Recap of Gil Tene’s guidance: “Latency is a per-operation distribution, often multi-modal with hiccups from GC, hypervisor pauses, IO flushes. Averages/medians and ‘95th only’ dashboards (e.g., Grafana) hide reality, and averaging percentiles is invalid.” Observability theater hides the problems that matter.
AI Is Writing Its Own Kernels, and They Are 17x Faster - LLMs/agents can synthesize and autotune CUDA/Triton kernels tailored to specific tensor shapes and hardware. Reported gains (e.g., 17×) often target microbenchmarks. The warning: “Measure end-to-end speedups on your real models and data. Ship safely by wrapping as PyTorch custom ops with parity tests, CI benchmarks, arch guards, and fallbacks to vendor libs.” Microbenchmarks hide production failure modes.

Why It Matters: A slow agent isn’t just annoying—it’s a safety incident when it’s auto-committing database changes or placing orders. You need continuous performance regression detection (RegreSQL for queries, A/B tests for agents), full latency distributions (P99/P99.9 SLIs, not just P95), and cost anomaly monitoring as early warning signals. When performance and safety are coupled, you can’t treat them as separate concerns.

Theme 4: Privacy Theater Is Collapsing

The Pattern: The gap between stated privacy policies and actual data use is becoming legally and technically untenable. Every major privacy framework is under stress.

Evidence:

I caught Google Gemini using my data–and then covering it up - Already covered above. The broader point: hidden system prompts that instruct models to conceal data usage are legally and ethically indefensible. This is documented deception, not a bug.
I have recordings proving Coinbase knew about breach 4 months before disclosure - On January 7, 2025, the author received a phishing call containing exact Coinbase account details. They sent Coinbase headers showing Amazon SES with DKIM alignment for coinbase.com. Coinbase replied once, then went silent. In May, Coinbase disclosed that bribed TaskUs contractors exfiltrated PII, balances, and IDs. Four-month disclosure delay.
Cities Panic over Having to Release Mass Surveillance Recordings - A Washington judge ruled that Flock Safety ALPR camera images are public records under the Public Records Act. Flock captures full-scene visuals (not just plates) and enables searches by make, color, features, and uploaded photos. Cities began shutting off systems to avoid disclosure.
New EU Chat Control Proposal Moves Forward - The EU’s revised CSAR (Chat Control 2.0) moved to Coreper. Mandatory scanning is removed, but Article 4 ‘risk mitigation’ could pressure services—including E2E messengers—to scan content via client-side detection. The plan expands detection to chat text and metadata and adds age verification that limits anonymity. Experts say reliable E2EE CSAM detection is not feasible, raising both legal and technical risk.
Copyright Winter Is Coming (To Wikipedia?) - Judge Sidney Stein (S.D.N.Y.) denied OpenAI’s motion to dismiss output-based copyright claims (Authors Guild v. OpenAI, October 27, 2025). The court said ChatGPT’s detailed plot summaries of fiction may infringe as abridgments. Outputs cited “by reference” were enough to survive dismissal. This puts Wikipedia-style summaries under legal scrutiny.

Why It Matters: You can’t hide behind vague privacy policies anymore. Design for opt-in memory and user-visible data usage (the Gemini failure shows why). Third-party contractor access needs least-privilege, masked views, and comprehensive audit logs (the Coinbase lesson). Implement output logging and provenance tracking for legal review (Authors Guild v. OpenAI). The legal and regulatory environment is tightening in unpredictable ways—courts are applying copyright to outputs, governments want both weaker privacy rules and more invasive monitoring, and surveillance vendors can no longer claim “anonymity” when they’re capturing full-scene images. Build defensively.

Meta-Observation: The Infrastructure Complexity Spiral

What looked like “simple” model inference two years ago now requires:

Orchestration infrastructure (the reasoning tax)
Verification layers (the trust problem)
Performance + safety monitoring (coupled failure modes)
Privacy and compliance scaffolding (legal risk)

This isn’t “AI is hard”—this is infrastructure maturity catching up to production reality. The industry tried to skip from research demo to production and is now backfilling all the reliability, safety, and governance layers that mature infrastructure requires.

Data product builders who understand this inflection point have an advantage: while others are debugging why their agent hallucinated and cost $10K in API calls, you’ve designed for verification, monitoring, and cost governance from day one. The winners in the next year won’t be those with the best prompts—they’ll be those who built the scaffolding to make AI systems trustworthy, observable, and economically viable.

Looking Ahead

Questions to explore:

How do you instrument multi-agent systems for cost attribution when a single user query spawns 50 tool calls across 3 models?
What does “acceptable” error rate look like for agents that auto-commit database changes? Is 1% okay? 0.1%? Who decides?
If client-side scanning becomes mandatory in the EU, what happens to E2EE messaging providers that operate globally?
When AI-generated code (kernels, agents) causes production incidents, who’s liable—the model vendor, the orchestration platform, or the practitioner who deployed it?

Methodology Note: This analysis covered all 66 stories published November 14-21, 2025. Every story was read and analyzed. Themes were identified by analyzing summaries and key takeaways for recurring patterns across the complete dataset.

Discussion about this post

Ready for more?