Your dbt Just Got a New Engine (and a New Owner)

The Data Product Report: Weekly State of the Market in Data Product Building | Week ending June 8, 2026

Jun 09, 2026

This Week

You probably have a dbt project that takes too long to parse. You have been waiting for the engine to get faster, and this week it did: dbt Labs announced a Rust rewrite that claims up to 10x faster parse times, a merger with Fivetran that consolidates ingestion and transformation under one roof, and a feature called dbt State that skips unchanged models so your CI pipeline stops rebuilding the world on every commit. That alone would make this a consequential week. But the same few days also surfaced a quieter lesson: the assumptions baked into your stack have a way of going stale without anyone noticing. A tokenizer swap silently inflated large language model (LLM) costs for production analytics interfaces. A Databricks benchmark showed that the Hive-era partitioning most lakehouse teams still run is optimizing for a pruning mechanism their engine does not use. The common thread is not that everything broke. It is that nothing looked broken until someone measured.

Your dbt Just Got a New Engine (and a New Owner)

If you run a dbt project with more than a few hundred models, you know the feeling: you change one file, hit build, and wait while the parser walks every model in the project before it gets to yours. That parse time is the tax you pay on every CI run, every local iteration, every “let me just check this one thing.” It is the thing that makes large dbt projects feel heavy.

This week dbt Labs announced dbt Core v2.0 at Snowflake Summit, and the headline is a Rust-based engine called Fusion that rewrites the parser from scratch. The claimed improvement is up to 10x faster parse times. The alpha is installable now (pip install dbt==2.0.0-preview.x) with adapters for Snowflake, BigQuery, Databricks, and Redshift. A proprietary build adds column lineage and instant feedback, but the Rust engine itself is open source.

The second announcement is the one that changes the vendor landscape: dbt Labs is merging with Fivetran. Ingestion and transformation, two layers that every analytics team runs as separate tools with separate contracts, will consolidate under one roof. For teams already running both, the merger is a procurement question and a roadmap question. For teams running alternatives on either side, the merger is a competitive signal about where bundling is heading.

The third announcement is quieter and possibly more useful day-to-day: dbt State, a feature that tracks which models changed since the last run and skips the rest. If your CI pipeline rebuilds everything because it cannot tell what changed, dbt State is a direct answer. The compute savings are proportional to how much of your project is unchanged on a given commit, which for most teams is most of it.

Meanwhile, dltHub won Snowflake’s Partner of the Year and shipped a Snowflake Native App that replicates Microsoft SQL Server, Oracle, MySQL, and Postgres entirely in-account, with no external orchestrator. The stat that caught the eye: 91% of new dlt pipelines in January 2026 were agent-built. The open-source ingestion layer is accelerating with or without the Fivetran merger, and the question of who owns the ingestion-to-transformation path just got more competitive.

The practical test this week is straightforward. Install the v2.0 alpha on a branch, run your existing project through it, and compare parse times against your current setup. If your full build takes 10 minutes or more, the Rust engine improvement will be measurable. Then try dbt State on a commit that touches one model and see how much of your CI pipeline it skips. Those two numbers tell you whether the migration is worth planning now or watching for a cycle.

The bottom line: The teams that installed the dbt v2.0 alpha this week got a parse-time number they can hold up against their current CI pipeline and a change-aware build that skips what didn’t move. The Fivetran merger changes the vendor map, but the engine is what changes Tuesday.

The Bill Went Up and Nobody Changed the Code

You have probably noticed that managing LLM costs in production feels less like budgeting and more like chasing a number that moves when you are not looking. If you are running a text-to-SQL interface, a governed natural-language query layer, or any agent that talks to your warehouse through an AI model, the cost question you care about is: what does it cost to finish a task? Not what does a token cost.

This week, the gap between those two questions became concrete. Anthropic’s Claude Opus 4.7 switched tokenizers, and independent tests show the same prompts now consume roughly 32 to 45 percent more tokens for text (up to 3x for images). The price per token did not change. The number of tokens per prompt did. After caching, net costs rose 12 to 27 percent. No code change, no opt-out, light disclosure.

The mechanism is what makes this worth understanding, not just noting. A tokenizer is the component that splits your prompt into the units the model charges for. When a vendor ships a new tokenizer, the same English sentence becomes more or fewer tokens. Anthropic’s new tokenizer produces more. The price list says the rate is the same; the meter runs faster.

Community reaction was blunt: frustration about hidden cost increases, and a growing consensus that cost-per-token is a vendor-controlled number that tells you almost nothing about what you are actually spending. The call is for cost-per-finished-task benchmarking, where you measure what it costs to answer a question, generate a report, or complete a workflow, regardless of how many tokens the model consumed along the way.

For data teams, this is not abstract. If you wired a semantic layer to an LLM for natural-language queries, or you ship a Slack bot that runs SQL on behalf of analysts, your cost basis just shifted. The queries are the same. The answers are the same. The bill is higher. And because the change is in the tokenizer, not the API contract, your monitoring probably did not catch it unless you were already tracking token counts per task.

The teams that caught this early share a pattern: they pin model versions in production so a tokenizer swap does not silently change their economics. They track tokens consumed per finished task, not just per API call. And they re-benchmark cost-per-task across model versions before upgrading, so the cost change is a decision they made, not a surprise they absorbed.

The bottom line: The teams that were already tracking cost-per-finished-task noticed the Opus 4.7 tokenizer shift in their dashboards this week and pinned their version before the bill landed. The ones tracking cost-per-token saw the same rate card and missed the 12 to 27 percent increase hiding underneath it.

Your Partitioning Scheme Is Lying to You

If you set up a Delta Lake table more than a year or two ago, you probably partitioned it by date. Maybe by date and region. It is the Hive pattern, and it is what most teams default to because it is what they learned, what the documentation used to recommend, and what the table already has. The question worth asking this week: does the partitioning actually do what you think it does?

Databricks published a benchmark-backed argument that for Delta Lake and Iceberg tables, Liquid Clustering outperforms Hive-style partitioning on nearly every axis that matters: 35% lower clustering time, 22% faster queries, changeable keys, automatic handling of both low and high cardinality columns, no small-file problems, and lower write amplification.

The insight underneath the benchmarks is more interesting than the numbers. On Delta and Iceberg, pruning does not work the way most teams think it does. Hive partitioning relies on directory-level pruning: the query planner reads directory names to skip irrelevant partitions. But Delta and Iceberg do not prune by directory. They prune by reading transaction logs and per-column statistics at file granularity. The directory structure is cosmetic. The engine is already doing file-level pruning whether your table is partitioned or not.

That means the Hive-style partition scheme is not helping the query planner. It is creating small files when cardinality is high, preventing key changes when your query patterns shift, and adding write amplification on every insert. Liquid Clustering replaces all of this with changeable clustering keys that the engine optimizes automatically, sorting data within files by the clustering columns and letting the file-level stats do the pruning work.

The honest caveat: these are Databricks’ own benchmarks on their own platform. Independent validation has not arrived yet, and the improvement numbers will vary by workload. But the underlying mechanism (file-level stats, not directory pruning) is verifiable on your own tables.

The test: pick your three highest-cost partitioned Delta tables. Run your typical analytical queries against them as-is, then convert one to Liquid Clustering and run the same queries. Compare scan times, file counts, and write amplification. If the numbers move in the direction the benchmarks suggest, you have a concrete case for migrating your defaults. If they do not, you have data to explain why your workload is different.

The bottom line: The teams that ran the benchmark on their own Delta tables this week found out whether their partitioning scheme is doing real work or just creating small files. The ones still running Hive-style partitions by default are maintaining a layout strategy designed for a pruning mechanism their engine does not use.

The Radar

If you care about governed AI analytics:

Snowflake Summit turned the semantic layer into a multi-vendor platform feature in about 48 hours. AtScale and Snowflake launched Semantic Views for XMLA (XML for Analysis) Endpoints, exposing warehouse semantics directly to Excel and Power BI with a one-command setup. ThoughtSpot expanded its Cortex AI integration with bi-directional semantic sync. And OpenAI models are now available inside Snowflake Cortex AI. Three vendors shipping production semantic-layer integrations in one week is trend evidence, not a coincidence. If you are building a governed natural-language query interface, the platform options just multiplied.

If you’re building pipelines:

Dagster published a community showcase featuring a vertical data stack built on Polars, DuckDB, and DuckLake, with Dagster orchestrating. Worth a look if you are exploring what the non-cloud-warehouse analytical stack looks like when everything runs locally. And dltHub published a practical guide to schema evolution covering Avro, Protobuf, and versioned contracts for data pipelines. Useful when your agent-built pipelines start shipping schemas you did not design.

If you’re running Databricks:

Query Tags landed, adding model-level cost attribution to your warehouse queries. If you are benchmarking the cost impact of migrating table layouts (see above), this is how you measure it. Cross-Engine Attribute-Based Access Control (ABAC) via Unity Catalog lets you write one policy that governs Spark, Trino, Flink, and DuckDB reads. And Spark Real-Time Mode shipped with a transformWithState operator for sessionization, closing a gap that previously required Flink.

If you care about data quality tooling:

dltHub’s AI Workbench preview adds schema-aware data quality checks to agent-built pipelines. If you adopted dltHub Transformation and are wondering how to validate what the agent produced, this is the quality layer arriving alongside it.

If you’re evaluating dev tools:

MotherDuck shipped an Obsidian plugin that runs DuckDB queries directly inside your notes. Niche, but if your team uses Obsidian for documentation and you want SQL next to your runbooks, it is there.

What did your dbt parse time look like on the v2.0 alpha? Reply and tell us whether the Rust engine moved the needle on your project.

The Data Product Report is published every Tuesday by RepublicOfData.io.

Discussion about this post

Ready for more?