Two ways to turn a question into SQL
Every natural-language analytics product on the market - every copilot, every chat-with-your-data feature, every AI analyst - converts a question into a database query. There are exactly two architectures for doing it.
Probabilistic text-to-SQL hands the question, plus whatever schema context fits in the prompt, to a large language model and asks it to write SQL. The model produces the most likely query, token by token. It is fast to build and demos beautifully. Its defining property is that the output is a sample from a distribution: the same question can produce different SQL on different runs, and nothing in the architecture distinguishes a correct query from a fluent one.
Deterministic text-to-SQL puts a compilation step between intent and execution. The question is resolved against an explicit semantic model - entities, metrics, relationships, join paths, policies - and the SQL is constructed from that model rather than guessed. The defining property is the inverse: the same question, in the same scope, against the same model version, always produces the same SQL. And when the question cannot be resolved - an unknown term, an unprovable join, an unauthorised column - the system fails loudly with an error instead of returning a plausible wrong answer.
In Colrows terms, the deterministic pipeline is: intent → context resolution → constrained planning → governed execution. The LLM still does what LLMs are good at - understanding what the human meant - but the query itself is compiled through the semantic graph, with join path proof and compile-time governance, and emitted as dialect-perfect SQL.
The accuracy cliff is measured, not theoretical
The case for determinism does not rest on vendor claims. It rests on a remarkably consistent published record. All figures below are the cited sources' own numbers, current as of June 2026.
On real enterprise data, raw generation collapses
Spider 2.0 (Lei et al., ICLR 2025) replaced the small, clean academic schemas of earlier benchmarks with 632 tasks drawn from real enterprise workflows - databases that routinely exceed 1,000 columns, queries that run past 100 lines. The result, in the authors' words: their o1-preview-based agent framework solved 21.3% of Spider 2.0 tasks, against 91.2% on the original Spider 1.0. For GPT-4o the project reports 10.1% on Spider 2.0 versus 86.6% on Spider 1.0. Same models. The benchmark got honest; the accuracy fell off a cliff.
The 2026 follow-up is harsher. BEAVER, an MIT CSAIL-group benchmark built from real private enterprise warehouse query logs (9,128 question-SQL pairs across 812 tables), reports that state-of-the-art agentic frameworks using GPT-5.2 achieve 10.8% accuracy - rising only to 30.1% even when the system is handed oracle hints. And on BIRD, the long-running benchmark of "big and dirty" databases, the best published system sits around 82% execution accuracy while human data engineers score 92.96% - a gap that has persisted for years of model releases. (We dissect these benchmarks - what they measure and why the scores collapse - in The Text-to-SQL Accuracy Cliff.)
Adding structure is what recovers the accuracy
The same literature shows exactly one reliable fix, and it is not a bigger model.
- Knowledge graphs. Sequeda, Allemang and Jacob tested GPT-4 on enterprise insurance-schema questions: zero-shot SQL accuracy was 16.7%, rising to 54.2% when the same questions were answered over a knowledge-graph representation of the same database. On the high-schema-complexity questions, zero-shot SQL scored 0%.
- Semantic models. Snowflake's engineering team reported that GPT-4o single-shot generation scored about 51% on their internal set of 150 real-world BI questions, while Cortex Analyst - the same class of model wrapped in a hand-authored semantic model - reached 90%+. (In coverage of the launch, VentureBeat reported Snowflake's comparison placing dedicated text-to-SQL tools, including Databricks' Genie, around 79% - Snowflake's claim about a competitor, so read it as such.)
- Deterministic compilation. dbt Labs' April 2026 benchmark (11 insurance-domain questions, 20 runs each) measured frontier models at 84-90% on raw text-to-SQL and 98-100% through their deterministic semantic layer, "because the Semantic Layer's deterministic query generation means the LLM can't produce subtly wrong results." Their summary is the whole argument in two sentences: "With text-to-SQL, failure looks like a plausible but incorrect answer. With the Semantic Layer, failure looks like an error message."
Even the vendors shipping probabilistic systems say this in their documentation. Microsoft's Power BI Copilot docs state plainly that Copilot "can produce inaccurate or low-quality outputs, including incorrect answers to data questions," and that the underlying model "is nondeterministic and isn't guaranteed to produce a correct answer, or the same answer with the same prompt, model, and data." That is not a criticism of Microsoft - it is unusually honest labelling of what probabilistic generation is.
What about the leaderboard climbers?
By mid-2026, the public Spider 2.0 leaderboard shows agentic systems self-reporting scores in the 70-96% range. Look at what those entries are: multi-step agent scaffolds that explore the schema, retrieve context, draft, execute, and self-correct - sometimes across dozens of model calls per question. The gap from 10% to 90% was closed by adding structure and verification around the model, never by the raw model alone. Which is the thesis: fix the context, not the model. The only open question is whether that structure is rebuilt ad hoc inside every agent run, or modelled once, governed, versioned, and compiled against.
Why a better model doesn't fix it
The intuition "GPT-6 will solve this" fails for a structural reason: the missing information is not in the training data, and most of it is not in the prompt either. Which of the three revenue columns is the governed one. Whether customer in finance means the same entity as account in sales. Which join path through the billing tables the data team considers correct. Which rows this specific user is allowed to see, in this specific jurisdiction, today. These are facts about your enterprise, not about SQL. A model can only guess at them - fluently, confidently, and differently on each run.
That is why every published accuracy gain in this space comes from the same move: make the enterprise context explicit and force the query through it. The knowledge graph in the Sequeda study, the semantic model in Cortex Analyst, MetricFlow in dbt's benchmark, the agentic scaffolds atop Spider 2.0 - they are all versions of one idea, executed at different depths. The semantic layer is that idea made into permanent, governed infrastructure instead of a per-query improvisation.
The buyer's framework: seven questions
If you are evaluating natural-language analytics - a copilot, an AI analyst, a semantic layer, or a semantic execution layer - these seven questions separate the architectures faster than any feature matrix.
- Ask the same question twice. Is the SQL byte-identical? If not, the system is sampling, not compiling - and every downstream number inherits that variance. This single test classifies most products immediately.
- What happens when it cannot answer? The deterministic failure mode is an error: unknown term, unprovable join, unauthorised scope. The probabilistic failure mode is a confident wrong number. Ask the vendor to demonstrate a failure, not a success.
- Who proves the joins? A multi-table answer is only as correct as its join path. Demand join path proof: the system should show that the path from entity to entity was validated against a typed model, not assembled by the LLM from column-name similarity.
- Where is governance enforced? Compile-time governance - RBAC, ABAC, and row/column-level predicates injected before SQL is generated - means unauthorised data is never read. Governance applied after generation, or worse, after retrieval, means your audit story depends on filters nobody can prove ran. Compile-time governance. Not after-the-fact.
- Can you read the SQL? Auditable SQL is the difference between an answer you can defend to a regulator and an answer you have to take on faith. Every answer should ship with the exact query that produced it.
- Who maintains the semantic model? This is where deterministic systems differ from each other. Hand-authored semantic models (YAML metric specs, BI worksheets) buy determinism at the price of a standing curation team - and the model drifts the day the schema changes. Evaluate autonomous maintenance: graph construction from the warehouse, catalogs, and documentation, with drift detection proposing updates instead of tickets.
- Can you reproduce last quarter's answer? Definitions change. A versioned semantic model with a point-in-time reproducible audit trail can replay any answer under the definitions in force when it was given. If the vendor cannot replay history, every number you have ever shipped is unverifiable.
When is probabilistic generation acceptable?
Honesty cuts both ways: probabilistic text-to-SQL is not useless. The framework is about matching architecture to stakes.
| Situation | Probabilistic generation | Deterministic compilation |
|---|---|---|
| Analyst exploring data, reads SQL before trusting it | Fine - it is a drafting accelerator | Also works, with guarantees |
| Self-service analytics for non-technical users | Risky - users cannot detect subtle wrongness | Required - users inherit the system's guarantees |
| Numbers that reach boards, customers, or filings | Unacceptable failure mode | Required - reproducible and auditable |
| Regulated domains (BFSI, healthcare, pharma) | Audit cannot rely on sampled output | Required - compile-time governance + audit trail |
| Autonomous AI agents querying at machine speed | Errors compound silently across steps | Required - proven joins, loud failures, full traceability |
The pattern in the right-hand column: the moment a human stops reviewing every query - because the consumer is a business user, a board pack, or an agent - determinism stops being a preference and becomes the control.
Where Colrows sits
Colrows is a semantic execution layer: the deterministic architecture, taken to its conclusion. The semantic graph is versioned, typed, and multi-scope (global → datastore → persona → user), built and maintained autonomously with drift detection rather than hand-authored YAML. Every question - from a human in chat or an agent over HTTP, JDBC, or MCP - runs the same compile-then-execute pipeline: meaning resolved via multi-vector embeddings, joins proven against the graph, RBAC, ABAC, and row/column predicates injected at compile time, dialect-perfect SQL emitted for the target engine, and a point-in-time reproducible audit trail attached to the answer. Ask it the same question twice and you get the same SQL. Ask it something it cannot prove, and it tells you - before a single byte is read.
Stop asking which model writes better SQL. Start asking which architecture makes wrong SQL impossible to ship silently. Fix the context. Not the model.
Frequently asked questions
What is deterministic text-to-SQL?
An architecture in which natural-language intent is compiled through an explicit semantic model - entities, metrics, join paths, policies - rather than generated token-by-token by a language model. The same question, in the same scope, against the same model version, always produces the same SQL; questions that cannot be resolved fail with an error instead of a plausible wrong query.
Why is text-to-SQL inaccurate on enterprise data?
Because the decisive context lives outside the database: which column is the governed metric, which join path is approved, which rows a user may see. Benchmarks built from real enterprise warehouses make this visible - Spider 2.0 and BEAVER measure raw frontier-model success at roughly 10-21%, versus 86-91% on the older clean academic benchmarks.
How accurate is text-to-SQL in 2026?
It tracks the amount of structure around the model, not the model generation. Published figures: 10-51% for raw generation on real enterprise workloads; 90%+ with a hand-authored semantic model (Snowflake Cortex Analyst); 98-100% on well-modeled questions through a deterministic semantic layer (dbt, 2026). All figures are the cited vendors' and authors' own evaluations.
Does a better LLM fix text-to-SQL accuracy?
The evidence says no. BEAVER measured a state-of-the-art agentic framework on GPT-5.2 at 10.8% on real enterprise data; Snowflake measured GPT-4o at 51% and got to 90%+ by adding a semantic model, not a better model; dbt found the deterministic layer, not model choice, drove the gain. Every published path to high accuracy adds structure around the model.
What is the difference between a semantic layer and text-to-SQL?
Text-to-SQL is a capability: turn a question into a query. A semantic layer is infrastructure: an explicit, governed model of what the data means. A semantic execution layer compiles intent through that model into governed, dialect-perfect SQL with join path proof and compile-time governance - so the generation step is constrained instead of free-form. The full comparison is in RAG vs Semantic Layer and the pillar guide.
A note on the numbers
Every figure on this page is the cited source's own published claim - benchmark papers, vendor engineering blogs, and official documentation - linked inline, as of 11 June 2026. Benchmarks are dataset-specific (dbt's is 11 questions; Snowflake's is 150 internal questions; leaderboard entries are self-reported): treat them as evidence of a directional pattern, not universal constants. This page is reviewed quarterly; if a number has moved, the source links are the ground truth.
