Agents Get Scaffolding, Open Models Get Serious, Europe Gets Out

The Data Report - Week ending December 5, 2025

Dec 07, 2025

Three things happened this week that matter for how you build data products: agent infrastructure stopped being handwavy, open-weight models started competing where frontier models live, and European regulators decided US cloud access is a policy risk they’re no longer willing to accept.

The first is the most actionable. The second changes your vendor calculus. The third is a slow-moving train you should probably be tracking.

Agent Infrastructure Grows Up

For the past year, “just build an agent” has meant: write a loop, pray for context coherence, restart when it hallucinates. This week, actual patterns emerged.

Anthropic published Effective harnesses for long-running agents—and it’s not another prompt engineering post. The pattern: split initialization from execution. An initializer agent creates scaffolding (init.sh, claude-progress.txt, initial git commit), then a coding agent iterates feature-by-feature with structured updates. Each session writes artifacts the next can recover from. Compaction doesn’t save you. External state does.

This matches what Beads shipped: a git-backed, graph-based issue system designed specifically for multi-agent coordination. Hash-based IDs prevent collisions across branches/clones. Agent Mail provides <100ms sync with 98.5% less git traffic. The project exists because sequential state in a multi-agent world breaks.

Meanwhile, Writing a Good Claude.md crystallizes what stateless agents need from your repo: WHAT/WHY/HOW, not command dumps. Claude may ignore noisy context (the harness injects a system reminder to do so), so keep it minimal and universally relevant.

Three YC companies—Saturn, Poka Labs, Rocketable—posted founding engineer roles this week. All want the same thing: production LLM agents with explicit state machines, eval flywheels, fault tolerance, and model-agnostic gateways. The job descriptions read like a checklist of what’s missing in most agent codebases.

The pattern converging: external state (git, files, DBs), explicit scaffolding, constrained scope per session. This is infrastructure now, not vibes.

Open-Weight Models Stop Catching Up

Open models used to trail frontier by 6-12 months. “Good enough for fine-tuning” was the pitch. This week, that framing became obsolete.

Mistral 3 shipped under Apache-2.0: a sparse MoE with 41B active / 675B total parameters, multimodal, multilingual, with NVFP4 checkpoints for vLLM and TensorRT-LLM support. It ranks #2 non-reasoning on LMArena. Ministral 3 (3B/8B/14B) covers the edge. This isn’t a research release—it’s a production-ready family with inference optimization built in.

DeepSeekMath-V2 hit IMO gold-level performance and 118/120 on Putnam 2024. The approach: train a proof verifier, use it as the reward model for the generator, scale verification compute. Apache-2.0. Open for inference.

Qwen3-VL processes 256k tokens—two-hour videos—with near-perfect “needle” retrieval. Leads on visual math and document OCR. 2B-32B weights on Hugging Face, Apache-2.0.

Apple released STARFlow-V, an open-weights normalizing flow video generator that rivals diffusion quality. T2V/I2V/V2V in one model.

Arcee Trinity Mini: US-trained MoE reasoning model, Apache-2.0, with Trinity Large training on 2048 B300 GPUs for January.

The implication: vendor lock-in arguments are weaker. Hosting costs shift from API margins to inference optimization. If you’re still assuming open models are a fallback, reassess.

Europe Decides US Cloud Is a Policy Risk

This one moves slower, but the direction is clear.

Switzerland’s Privatim issued a resolution: international SaaS is inadmissible for sensitive or legally confidential authority data unless the authority controls client-side encryption keys. The reasons: US CLOUD Act compels disclosure even for Swiss-hosted data, contractual safeguards are insufficient, and provider transparency is too low.

Dutch universities are piloting OpenDesk and Nextcloud after the ICC lost Microsoft email access due to US sanctions. The point isn’t that Microsoft is malicious—it’s that core services can be revoked by policy, not outages.

The EU’s Chat Control 2.0 advances with “voluntary” provider scanning and mandatory age verification. And a CJEU ruling made platforms GDPR controllers for personal data in user posts—exposing them to Article 82 damages for content removed within an hour.

The pattern: US legal reach is now a classification criterion for European data. Client-side encryption with authority-controlled keys is the new baseline for sensitive workloads. Full migration off O365/Azure/AWS isn’t happening next quarter, but the policy foundation is being laid.

If you serve European clients or handle European data, track this.

The Efficiency Counternarrative

Not a theme, but a recurring tension worth noting.

Pete Warden—who led mobile TensorFlow—wrote “I know we’re in an AI bubble because nobody wants me”. His argument: the industry is overinvesting in GPUs and underinvesting in efficiency engineering. He built Jetpac to run AlexNet inference on hundreds of cheap EC2 CPUs because Caffe’s CPU path was training-oriented, not inference-optimized. Small cross-stack teams can deliver outsized cost savings—but that’s not where the capital goes.

Program-of-Thought prompting beat Chain-of-Thought by ~12% across math and finance datasets by offloading calculation to an external interpreter. Two separate CoT critiques made similar points: language scratchpads are inefficient for algorithmic tasks.

“Why are your models so big?” argues 15M-parameter models work for narrow tasks like SQL autocomplete—in the browser, at negligible cost.

Gary Marcus called it a trillion dollars potentially wasted, pointing to diminishing scaling returns and the need for neurosymbolic approaches.

Scale isn’t wrong—Gemini 3 and Trainium3 clusters prove scale works. But the question isn’t which is right; it’s which is right for your workload.

Quick Hits

Accelerator competition heats up. Amazon’s Trainium3 (3nm, >4x perf, NVLink Fusion interop planned) and Google selling TPUs to Anthropic/Meta/neoclouds are compressing Nvidia’s moat. CUDA-L2 used RL to generate kernels that beat cuBLAS. Multi-accelerator stacks are the future—portability matters.

SQLite keeps winning. One author demonstrated 100k TPS over a billion rows on an M1 Pro (WAL mode, tuned PRAGMAs). Another reminded us SQLite makes a good application file format—single-file, ACID, portable, toolable.

Security remains brittle. Researchers showed poetic framing bypasses guardrails at ~62% success rate. A $1B legal AI tool exposed 100k+ files via an unauthenticated API endpoint that returned a Box admin token in client JS.

AI expands scope, doesn’t replace judgment. Anthropic’s self-study: engineers use AI in ~60% of work, report ~50% productivity gains, but only 0-20% can be fully delegated. 27% of AI-assisted work is net-new—tasks that wouldn’t have been done otherwise. The concern: skill erosion and reduced peer collaboration.

The RAM shortage is real. Memory makers are prioritizing HBM for AI datacenters, cutting consumer lines. DDR4/DDR5 prices are 3-4x. Don’t expect cheap secondhand HBM—it’s integrated.

What to Watch

Agent scaffolding patterns will consolidate. The initializer/executor split, external state, and constrained scope are likely to become standard. Expect frameworks.

Open-weight models will keep closing the gap. Mistral 3 and DeepSeekMath aren’t anomalies—they’re the trend. Evaluate them seriously for production.

European data sovereignty isn’t going away. Swiss and Dutch moves this week are early, but the regulatory direction is clear. Start classifying data by jurisdiction exposure.

The efficiency argument will get louder. Not because scale doesn’t work, but because inference costs recur and most workloads don’t need frontier models.

Discussion about this post

Ready for more?