Building for Resilience

The Data Report: Weekly State of the Market in Data Product Building | Week ending January 18, 2026

Jan 20, 2026

This week, the community talked about what doesn’t break.

DuckDB keeps winning converts because it installs in seconds and runs without dependencies. A founder weighing MotherDuck isn’t chasing features; they’re chasing reliability. A data engineer leaves Microsoft Fabric not for something newer, but for something that works. Meanwhile, two separate discussions pushed the same message: AI doesn’t fix your data problems. It amplifies them. And the teams building production LLM pipelines are learning that structured outputs require engineering discipline, not optimism.

The thread running through it all: resilience. Not the buzzword kind. The kind where your pipeline runs without you babysitting it. Where your models mean what you think they mean. Where your LLM returns valid JSON instead of creative interpretations.

Four themes this week: foundations that make AI possible, local compute that just works, structured outputs that don’t fail, and the growing pains of a platform that promised everything.

Foundations Before AI

The semantic layer conversation has been building for years. AtScale’s 2025 Semantic Layer Summit surfaced a striking data point: LLMs were wrong 80% of the time without semantic guidance, but achieved near-perfect accuracy when grounded in a semantic layer. Gartner called semantic technologies “foundational” for AI success. SiliconANGLE’s January 2026 outlook put it simply: “2025 was about building agents. 2026 is about trusting them.”

This week, two discussions pushed the same message. One argued that data modeling isn’t dead; it’s more relevant than ever because multimodal AI increases the need to model structured, semi-structured, and unstructured data. You can’t point an LLM at a Kafka stream and expect a reliable warehouse. The other made the case that AI on top of a broken data stack is useless. LLMs increase the blast radius of bad data. Fragmented definitions, inconsistent metrics, and brittle pipelines don’t become better when AI amplifies them.

The community response was pragmatic. Many cited broken lineage and misaligned metrics as the cost of skipping modeling. The advice: invest in clean models, consistent metrics, and the right early hire before expecting value from GenAI.

What this tells us: The AI hype cycle is meeting data reality. Teams are learning that LLMs need well-modeled data, not magic wands.

Practitioner action: Adopt. Before investing in AI features, audit your data foundations. Semantic layers and dimensional models matter more now, not less.

The DuckDB Ascent

DuckDB’s trajectory is no longer speculative. Analysis of 1.8 million Hacker News headlines showed 50.7% year-over-year growth in developer interest. DB-Engines ranks it around #51, up from #81 a year ago. Amazon’s internal data suggests that 94% of query spending goes to computation that doesn’t need distributed compute. The “SQLite of analytics” label is sticking because it’s accurate: single-binary, zero dependencies, pip-installable, and fast.

This week, Robin Linacre’s post made the case for DuckDB as a default local analytics engine. It reads Parquet, CSV, and JSON from disk, S3, or HTTP. The SQL is rich (EXCLUDE, COLUMNS, QUALIFY, window aggregate modifiers). For CI testing and rapid iteration, it’s hard to beat.

Meanwhile, a founder asked whether building on MotherDuck is a mistake. Their stack (DLT to GCS to MotherDuck, dbt running in MotherDuck) works. The concern: ecosystem gaps, especially around ML and BI tooling. The community response was supportive: use what works today, decouple for portability, revisit as scale evolves.

What this tells us: DuckDB is graduating from “interesting project” to default choice for local analytics. MotherDuck extends that into SaaS territory for teams who want simplicity without self-managing.

Practitioner action: Try. If you’re reaching for pandas or Spark for local analytics, DuckDB deserves evaluation.

LLM-Data Integration Patterns

Getting LLMs to produce reliable structured outputs has become a core data engineering skill. A 2024 Gartner survey found that 75% of AI projects fail due to integration issues, often from inconsistent responses. The problem: prompts that work in testing fail after model updates, JSON parsers break on unexpected types, and field names mutate without warning.

The Structured Outputs Handbook surfaced on Hacker News this week. It covers the landscape: JSON mode, function calling, constrained decoding, validation libraries. The key insight: OpenAI’s structured outputs with constrained sampling score 100% on complex JSON schema following, compared to under 40% for older approaches. JSON schema enforcement can reduce parsing errors by up to 90%.

The discussion was practical. Structured outputs boost agent reliability, but teams should run evaluations and mix unconstrained generation with constrained retries when needed.

This connects to a broader pattern: LLMs are moving into ETL processes without human intervention. When an LLM generates transformation logic or extracts entities, schema control isn’t optional. Tools like Pydantic AI are emerging to address exactly this: structured outputs and schema validation as first-class concerns.

What this tells us: LLM integration is maturing from “prompt and pray” to engineering discipline.

Practitioner action: Try. If you’re building LLM-powered data extraction or transformation, learn the structured output patterns. Pydantic AI, Instructor, and native provider features are worth evaluating.

Microsoft Fabric’s Growing Pains

Microsoft Fabric criticism isn’t new. Brent Ozar’s May 2025 post called it “just plain unreliable,” noting that the status page showed green even during 12-hour outages. Fabric still has no SLA and offers no refunds for downtime. Redditors have resorted to reporting outages to third-party trackers like Statusgator.

This week, a solo data engineer detailed why they’re leaving Fabric. The complaints: random pipeline hangs with poor error messages, slow SQL Server ingestion, and shared capacity that pits ETL spikes against Power BI refreshes. The verdict: Fabric works for some, but the on-prem hybrid use case remains painful.

The community response was mixed but tilted negative. Some defend Fabric when using mirroring, capacity isolation, and Azure Data Factory for ingestion. But the consensus was clear: for teams with on-prem SQL Server and limited capacity budgets, simpler alternatives (DuckDB, Databricks, Snowflake, even just PostgreSQL) offer more predictable results. One commenter compared Fabric to “a 5-month-old baby” versus Databricks and Snowflake as “almost teenagers.”

What this tells us: Microsoft’s unified platform bet is hitting friction in the mid-market. The promise doesn’t match reality for hybrid/on-prem scenarios.

Practitioner action: Watch. If evaluating Fabric for hybrid or on-prem scenarios, the community’s experiences suggest careful capacity planning and realistic expectations about SQL Server ingestion.

The Thread

Resilience isn’t a feature you add later. It’s a choice you make from the start.

This week’s discussions had a common thread: practitioners choosing tools and practices that don’t break under pressure. Data modeling that gives AI something solid to work with. Local compute that runs without clusters or dependencies. Schema enforcement that prevents LLM outputs from going sideways. And the hard-won knowledge that a platform’s marketing doesn’t always match its operational reality.

The market is still moving fast. New tools launch weekly. AI capabilities expand monthly. But the teams building data products that last are the ones asking: will this still work when things go wrong? The boring answer is usually the resilient one.

Discussion about this post

Ready for more?