Semantic Layer vs Text-to-SQL: Why Deterministic Compilation Wins

Every "chat with your data" project eventually faces the same architecture decision, usually after the demo and before production. Path one: hand the question and the schema to an LLM and let it write SQL - text-to-SQL. Path two: put an explicit model of business meaning between the question and the database, and compile against it - a semantic layer. This piece treats that as the engineering decision it is: accuracy on real data, time to ship, total cost, failure modes - and the honest cases where raw text-to-SQL is exactly the right call.

The two architectures, precisely

Text-to-SQL is generation: prompt in, SQL out. The model receives the question plus whatever context fits - schema DDL, sample rows, maybe retrieved documentation - and produces a query by predicting tokens. Modern versions wrap this in agent loops that explore the schema, execute, and self-correct. The defining property survives every wrapper: the SQL is composed fresh, probabilistically, per request. Ask twice, and you may get two different queries.

A semantic layer is compilation: the question resolves against an explicit, versioned model of meaning - metrics, entities, relationships, join paths, policies - and the SQL is derived from those definitions mechanically. The LLM's job shrinks from "write correct SQL about a business it has never seen" to "map this question onto defined concepts." Same question, same definitions, same SQL, every time. The defining property: correctness lives in reviewable infrastructure, not in per-query model judgment.

These are not two maturity levels of one thing. They are different answers to the question "where should the knowledge of what your data means live?"

The decision, side by side

DimensionText-to-SQLSemantic layer + compilation
Accuracy on real enterprise data10-51% raw in published evaluations (Spider 2.0, BEAVER, Snowflake's internal set)90%+ to 98-100% on well-modeled questions (Snowflake, dbt benchmarks)
RepeatabilityNondeterministic - same question can yield different SQLDeterministic - same question + same graph version = same SQL
Failure modeFluent wrong answer; executes cleanly, looks correctCompilation error - loud, inspectable, safe
Time to first demoDays - this is its genuine superpowerLonger if the layer is hand-authored; comparable if it builds autonomously
Time to trusted productionOpen-ended: per-query evaluation suites, prompt tuning, guardrails, retesting per model releaseBounded by modeling coverage; the layer is testable like software
GovernanceWhatever the database enforces after the query arrivesCompile-time RBAC + ABAC + row/column predicates, before SQL exists
AuditabilityA log of generated queries; no stable definition behind themVersioned definitions, join path proof, point-in-time reproducible trail
Cost shapePer-query inference (agent loops multiply it) + ongoing evaluation labourModeling investment (manual or autonomous) amortized across every query

The accuracy line, briefly

We have covered the benchmark evidence at length elsewhere, so here is only the shape of it. On clean academic schemas, frontier models translate questions to SQL at 86-91%. On benchmarks built from real enterprise warehouses - Spider 2.0, BEAVER - the same models solve 10-21% of tasks, because the decisive context (which revenue column is governed, which join path is approved, what "active" means) is not in the schema. Every published recovery adds explicit structure: Snowflake measured 51% → 90%+ by adding a semantic model; dbt's 2026 benchmark measured 84-90% → 98-100% through deterministic compilation, "because the Semantic Layer's deterministic query generation means the LLM can't produce subtly wrong results." The anatomy is in The Text-to-SQL Accuracy Cliff; the buyer's framework is in Deterministic vs Probabilistic Text-to-SQL.

What this page adds is the engineering reading of those numbers: text-to-SQL's accuracy is a property you must continuously re-earn - per schema change, per prompt tweak, per model upgrade - because nothing in the architecture pins it down. A semantic layer's accuracy is a property you build once per concept and regression-test like any other software artifact.

The honest case for raw text-to-SQL

Vendors selling semantic layers - we are one - tend to skip this section. Text-to-SQL is genuinely the right architecture when a SQL-literate human reviews every query before anyone trusts it:

  • Engineering scratch work. An analyst who can read SQL asking a model to draft a gnarly window function is using the tool exactly right - the human is the verification layer.
  • One-off exploration. Unfamiliar dataset, throwaway questions, no downstream decisions riding on the numbers.
  • Prototypes and demos. Proving the interaction pattern before investing in the meaning layer.
  • Migration and refactoring tooling. Generating candidate translations between dialects, with tests as the arbiter.

Notice the common thread: in every case, the SQL is read. The architecture decision flips at exactly the point where consumers stop reading the SQL - a business user in a chat window, a scheduled report, an AI agent acting on the result. From that point, a fluent wrong answer has no natural predator. That is the production boundary, and it is sharper than most roadmaps admit.

The cost comparison nobody writes down

The demo economics of text-to-SQL are seductive: no modeling, just a prompt. The production economics invert:

  • Inference compounds. Production-grade text-to-SQL is no longer one model call - the systems topping the Spider 2.0 leaderboard run multi-step agent scaffolds that explore, draft, execute, and self-correct across many calls per question. You pay that latency and token bill on every query, forever, to rebuild context the architecture refuses to persist.
  • Evaluation becomes a department. Without deterministic behaviour, the only safety mechanism is empirical: golden-question suites, regression runs per model release, prompt-injection review, accuracy dashboards. Teams budget the LLM and forget the harness around it.
  • The semantic layer's cost is the modeling - which is real, and which is precisely why the question "who builds and maintains it?" decides the economics. Hand-authored layers (LookML, MetricFlow YAML, semantic views) trade the inference bill for an engineering backlog. An autonomously built and maintained graph - Colrows' approach, with drift detection keeping it current - takes the modeling line item out of both columns.

Where Colrows sits in this choice

Colrows is the compilation path, implemented end to end: a semantic graph - versioned, typed, multi-scope - built autonomously across the estate, and a compile-then-execute pipeline (intent → context resolution → constrained planning → governed execution) that emits dialect-perfect SQL with compile-time governance and join path proof. The LLM interprets intent; it never freehands the SQL. Failure is a compilation error with an explanation, not a confident guess - and every answer carries an audit trail that reproduces point-in-time.

Text-to-SQL asks the model to be right. A semantic layer makes it unnecessary for the model to be right. Stop retrieving. Start executing.

Frequently asked questions

What is the difference between a semantic layer and text-to-SQL?

Text-to-SQL generates SQL probabilistically from the question and schema context - fresh each time. A semantic layer resolves the question against explicit, versioned business definitions and compiles the SQL from them deterministically. The difference is where correctness lives: in per-query model judgment, or in reviewable infrastructure.

Is text-to-SQL accurate enough for production?

On real enterprise schemas, raw generation measures 10-51% in published evaluations, and it fails fluently rather than loudly. It is fine where a SQL-literate human reviews every query; it is the wrong architecture for business-user self-service, regulated reporting, or AI agents.

Why does a semantic layer improve accuracy?

Because the errors come from missing meaning, not weak models. Making the meaning explicit - and constraining generation through it - is the only intervention with consistent published gains: 16.7% → 54.2% (knowledge graph, Sequeda et al.), 51% → 90%+ (Snowflake's semantic model), 98-100% (dbt's deterministic layer).

When is raw text-to-SQL the right choice?

Engineer-supervised work: scratch queries, exploration, prototypes, migration tooling - anywhere the SQL gets read before it gets trusted. The boundary is the moment consumers stop reading the SQL.

A note on the numbers

All accuracy figures are the cited sources' published claims as of 12 June 2026, linked inline; several evaluation sets are small and vendor-internal - treat them as directional evidence, which is how we use them. This page is reviewed quarterly.

Don't ask the model to be right. Make it unnecessary. Compile the question.