The two architectures, precisely
Text-to-SQL is generation: prompt in, SQL out. The model receives the question plus whatever context fits - schema DDL, sample rows, maybe retrieved documentation - and produces a query by predicting tokens. Modern versions wrap this in agent loops that explore the schema, execute, and self-correct. The defining property survives every wrapper: the SQL is composed fresh, probabilistically, per request. Ask twice, and you may get two different queries.
A semantic layer is compilation: the question resolves against an explicit, versioned model of meaning - metrics, entities, relationships, join paths, policies - and the SQL is derived from those definitions mechanically. The LLM's job shrinks from "write correct SQL about a business it has never seen" to "map this question onto defined concepts." Same question, same definitions, same SQL, every time. The defining property: correctness lives in reviewable infrastructure, not in per-query model judgment.
These are not two maturity levels of one thing. They are different answers to the question "where should the knowledge of what your data means live?"
The decision, side by side
| Dimension | Text-to-SQL | Semantic layer + compilation |
|---|---|---|
| Accuracy on real enterprise data | 10-51% raw in published evaluations (Spider 2.0, BEAVER, Snowflake's internal set) | 90%+ to 98-100% on well-modeled questions (Snowflake, dbt benchmarks) |
| Repeatability | Nondeterministic - same question can yield different SQL | Deterministic - same question + same graph version = same SQL |
| Failure mode | Fluent wrong answer; executes cleanly, looks correct | Compilation error - loud, inspectable, safe |
| Time to first demo | Days - this is its genuine superpower | Longer if the layer is hand-authored; comparable if it builds autonomously |
| Time to trusted production | Open-ended: per-query evaluation suites, prompt tuning, guardrails, retesting per model release | Bounded by modeling coverage; the layer is testable like software |
| Governance | Whatever the database enforces after the query arrives | Compile-time RBAC + ABAC + row/column predicates, before SQL exists |
| Auditability | A log of generated queries; no stable definition behind them | Versioned definitions, join path proof, point-in-time reproducible trail |
| Cost shape | Per-query inference (agent loops multiply it) + ongoing evaluation labour | Modeling investment (manual or autonomous) amortized across every query |
The accuracy line, briefly
We have covered the benchmark evidence at length elsewhere, so here is only the shape of it. On clean academic schemas, frontier models translate questions to SQL at 86-91%. On benchmarks built from real enterprise warehouses - Spider 2.0, BEAVER - the same models solve 10-21% of tasks, because the decisive context (which revenue column is governed, which join path is approved, what "active" means) is not in the schema. Every published recovery adds explicit structure: Snowflake measured 51% → 90%+ by adding a semantic model; dbt's 2026 benchmark measured 84-90% → 98-100% through deterministic compilation, "because the Semantic Layer's deterministic query generation means the LLM can't produce subtly wrong results." The anatomy is in The Text-to-SQL Accuracy Cliff; the buyer's framework is in Deterministic vs Probabilistic Text-to-SQL.
What this page adds is the engineering reading of those numbers: text-to-SQL's accuracy is a property you must continuously re-earn - per schema change, per prompt tweak, per model upgrade - because nothing in the architecture pins it down. A semantic layer's accuracy is a property you build once per concept and regression-test like any other software artifact.
The honest case for raw text-to-SQL
Vendors selling semantic layers - we are one - tend to skip this section. Text-to-SQL is genuinely the right architecture when a SQL-literate human reviews every query before anyone trusts it:
- Engineering scratch work. An analyst who can read SQL asking a model to draft a gnarly window function is using the tool exactly right - the human is the verification layer.
- One-off exploration. Unfamiliar dataset, throwaway questions, no downstream decisions riding on the numbers.
- Prototypes and demos. Proving the interaction pattern before investing in the meaning layer.
- Migration and refactoring tooling. Generating candidate translations between dialects, with tests as the arbiter.
Notice the common thread: in every case, the SQL is read. The architecture decision flips at exactly the point where consumers stop reading the SQL - a business user in a chat window, a scheduled report, an AI agent acting on the result. From that point, a fluent wrong answer has no natural predator. That is the production boundary, and it is sharper than most roadmaps admit.
The cost comparison nobody writes down
The demo economics of text-to-SQL are seductive: no modeling, just a prompt. The production economics invert:
- Inference compounds. Production-grade text-to-SQL is no longer one model call - the systems topping the Spider 2.0 leaderboard run multi-step agent scaffolds that explore, draft, execute, and self-correct across many calls per question. You pay that latency and token bill on every query, forever, to rebuild context the architecture refuses to persist.
- Evaluation becomes a department. Without deterministic behaviour, the only safety mechanism is empirical: golden-question suites, regression runs per model release, prompt-injection review, accuracy dashboards. Teams budget the LLM and forget the harness around it.
- The semantic layer's cost is the modeling - which is real, and which is precisely why the question "who builds and maintains it?" decides the economics. Hand-authored layers (LookML, MetricFlow YAML, semantic views) trade the inference bill for an engineering backlog. An autonomously built and maintained graph - Colrows' approach, with drift detection keeping it current - takes the modeling line item out of both columns.
Where Colrows sits in this choice
Colrows is the compilation path, implemented end to end: a semantic graph - versioned, typed, multi-scope - built autonomously across the estate, and a compile-then-execute pipeline (intent → context resolution → constrained planning → governed execution) that emits dialect-perfect SQL with compile-time governance and join path proof. The LLM interprets intent; it never freehands the SQL. Failure is a compilation error with an explanation, not a confident guess - and every answer carries an audit trail that reproduces point-in-time.
Text-to-SQL asks the model to be right. A semantic layer makes it unnecessary for the model to be right. Stop retrieving. Start executing.
Frequently asked questions
What is the difference between a semantic layer and text-to-SQL?
Text-to-SQL generates SQL probabilistically from the question and schema context - fresh each time. A semantic layer resolves the question against explicit, versioned business definitions and compiles the SQL from them deterministically. The difference is where correctness lives: in per-query model judgment, or in reviewable infrastructure.
Is text-to-SQL accurate enough for production?
On real enterprise schemas, raw generation measures 10-51% in published evaluations, and it fails fluently rather than loudly. It is fine where a SQL-literate human reviews every query; it is the wrong architecture for business-user self-service, regulated reporting, or AI agents.
Why does a semantic layer improve accuracy?
Because the errors come from missing meaning, not weak models. Making the meaning explicit - and constraining generation through it - is the only intervention with consistent published gains: 16.7% → 54.2% (knowledge graph, Sequeda et al.), 51% → 90%+ (Snowflake's semantic model), 98-100% (dbt's deterministic layer).
When is raw text-to-SQL the right choice?
Engineer-supervised work: scratch queries, exploration, prototypes, migration tooling - anywhere the SQL gets read before it gets trusted. The boundary is the moment consumers stop reading the SQL.
A note on the numbers
All accuracy figures are the cited sources' published claims as of 12 June 2026, linked inline; several evaluation sets are small and vendor-internal - treat them as directional evidence, which is how we use them. This page is reviewed quarterly.
