The Text-to-SQL Accuracy Cliff: 91% on Benchmarks, 21% in Production

The same model. The same task. A 70-point drop. On the academic benchmark every vendor quotes, frontier models turn questions into SQL with 86-91% success. On benchmarks built from real enterprise databases, the same models solve 10-21% of tasks. That gap is not noise, and it is not fixed by the next model release. It has a structure - and once you see it, you will read every "our AI is 95% accurate" slide differently.

What the famous numbers actually measure

For years, the standard text-to-SQL benchmark was Spider 1.0: thousands of question-SQL pairs over small, clean, well-named databases - a college enrolment schema here, a concert-singer database there. A handful of tables. Obvious joins. Column names that mean what they say. On that terrain, modern LLMs are genuinely excellent: 86.6% for GPT-4o, 91.2% for o1-preview-based agents.

Those are the numbers that made "just ask your data" demos credible, and they are real. The problem is what they are numbers about. A Spider 1.0 task is a translation exercise: the question contains nearly everything needed to write the query, and the schema is small enough to read whole. Production enterprise data is not a translation exercise. It is an interpretation exercise against a thousand-column estate with twenty years of accumulated naming decisions.

What happens on real enterprise data

Three benchmark programs have now measured that second problem directly. They agree with each other to an uncomfortable degree.

  • Spider 2.0 (Lei et al., ICLR 2025) rebuilt the benchmark from 632 real enterprise workflow tasks - BigQuery and Snowflake estates that often exceed 1,000 columns, solutions that can require multi-step SQL beyond 100 lines. The authors' o1-preview-based agent framework solved 21.3%, versus 91.2% on Spider 1.0. The project site puts GPT-4o at 10.1%, versus 86.6% on Spider 1.0. Same models, honest data, an 8.5× collapse.
  • BEAVER (an MIT CSAIL-group benchmark, 2026) went further: 9,128 question-SQL pairs harvested from real private enterprise warehouse query logs across 812 tables. State-of-the-art agentic frameworks on GPT-5.2 achieve 10.8%. Handed oracle hints for every subtask, they reach 30.1%. On data that was never on the public internet, the cliff is steeper still.
  • BIRD, the long-running benchmark of large, "dirty" databases, shows the gap from the other side: the best published system scores around 82% execution accuracy while human data engineers score 92.96% - a double-digit gap that has survived every model generation since the leaderboard opened.
Bar chart showing the text-to-SQL accuracy cliff: GPT-4o at 86.6% and o1-preview agents at 91.2% on the academic Spider 1.0 benchmark, collapsing to 10.1% and 21.3% on enterprise Spider 2.0, with BEAVER's GPT-5.2 agentic result at 10.8% on real warehouse logs.
The cliff, in the benchmarks' own numbers. Sources: spider2-sql.github.io, arXiv:2411.07763, arXiv:2409.02038, as of June 2026.

The three gaps that create the cliff

Why does the same model lose 70 points crossing from benchmark to warehouse? Three specific things change.

1. Scale breaks the reading strategy

On a 30-column schema, the model reads everything and reasons over all of it. On a 1,400-column estate, the schema does not fit usefully in context - and even when it technically fits, relevance gets diluted. The model must guess which subset of the schema matters before it writes a line of SQL, and a wrong guess at that step is unrecoverable. This is why Spider 2.0's hardest failures are not syntax errors; they are queries built on the wrong tables.

2. The decisive semantics are not in the database

Which of the four columns containing the word "revenue" is the one finance signs off on? Is customer in the billing schema the same entity as account in CRM? Does "last quarter" mean calendar or fiscal? No schema encodes these answers. They live in documentation, in BI definitions, in the heads of analysts - outside anything the model can see. The Sequeda, Allemang and Jacob study isolated this variable: GPT-4 zero-shot on an enterprise insurance schema scored 16.7%, and on the high-schema-complexity questions, 0%. The same questions over a knowledge-graph representation of the same database: 54.2%. Nothing about the model changed. The available meaning did.

3. Nothing verifies the answer

A generated query that parses and executes looks exactly like a correct one. There is no ground truth at runtime, no error message for "this join silently double-counts subscriptions." Snowflake's engineering team measured GPT-4o at 51% on their internal set of 150 real BI questions - which means roughly every second answer was wrong, fluently. dbt's 2026 benchmark named this failure mode precisely: "Text-to-SQL will cheerfully give you a wrong number."

What provably closes the gap

The published record contains exactly one pattern that recovers the lost accuracy, demonstrated independently at least four times: make the missing context explicit, and constrain generation through it.

StudyWithout structureWith structureWhat was added
Sequeda et al. (2023)16.7%54.2%Knowledge-graph representation
Snowflake (2024)51%90%+Hand-authored semantic model
dbt Labs (2026)84-90%98-100%Deterministic semantic layer (11-question set)
Spider 2.0 leaderboard (2026)10-21% raw70-96% self-reportedMulti-step agentic scaffolds with schema exploration and self-correction

Read the fourth row carefully, because it is the one vendors will quote next. The systems now topping the Spider 2.0 leaderboard are not better raw models - they are elaborate scaffolds that explore the schema, retrieve context, execute, and self-correct across many model calls per question. They close the gap by rebuilding the missing structure on the fly, per query, every time. It works, at the cost of latency, tokens, and zero guarantees of consistency between runs. The alternative is to build that structure once - as a governed, versioned semantic layer - and compile every query through it. Same insight, durable form.

How to read a vendor's accuracy claim

Once you know the cliff exists, accuracy claims become checkable. Four questions strip the marketing off any number:

  • Accuracy on whose data? Spider 1.0-class benchmarks predict almost nothing about a thousand-column warehouse. If the claim is not measured on enterprise-scale schemas - or better, your schemas - it is a translation-exercise score.
  • How many questions, how many runs? dbt's benchmark is honest about being 11 questions; Snowflake's is 150. Small evaluation sets are fine if disclosed - undisclosed, they are a red flag.
  • Single-shot or scaffolded? If the demo uses an agent loop with retries and self-correction, ask what the per-question cost and latency are, and whether two runs give the same answer.
  • What does failure look like? The most important question. A system that errors when it cannot answer is operationally safe at 90%. A system that fabricates when it cannot answer is operationally dangerous at 99%.

That last distinction - loud failure versus fluent failure - is the dividing line between probabilistic and deterministic architectures, and it deserves its own treatment: the full buyer's framework is in Deterministic vs Probabilistic Text-to-SQL.

Where Colrows sits

Colrows is built on the premise the benchmarks keep confirming: the accuracy is in the context, so the context should be infrastructure. The semantic graph - versioned, typed, multi-scope - captures the meaning the cliff studies found missing: governed metric definitions, proven join paths, entity identity across systems, policies. Every question compiles through it: intent → context resolution → constrained planning → governed execution, with compile-time governance and dialect-perfect SQL out the other side. And because the graph is built and maintained autonomously with drift detection, the structure that closes the cliff does not become a second modeling backlog.

The cliff is not a model problem wearing a data costume. It is a context problem wearing a model costume. Fix the context. Not the model.

Frequently asked questions

Why do text-to-SQL models score high on benchmarks but fail in production?

Because the benchmarks and production are different problems. Academic sets like Spider 1.0 use small, clean, well-named schemas where the question contains nearly everything needed. Real estates exceed a thousand columns, with ambiguous names, multiple plausible join paths, and decisive context that lives outside the database. Spider 2.0, built from real workflows, measured the same models at 10-21% that scored 86-91% on Spider 1.0.

What is the Spider 2.0 benchmark?

A 632-task text-to-SQL benchmark (Lei et al., ICLR 2025) derived from real enterprise workflows on BigQuery, Snowflake, and other platforms, with databases that often exceed 1,000 columns and solutions requiring multi-step SQL past 100 lines. Reported results: 21.3% for the authors' o1-preview agent framework (versus 91.2% on Spider 1.0); 10.1% for GPT-4o (versus 86.6%).

How accurate is natural language to SQL on enterprise data?

Raw generation: roughly 10-51% in published evaluations (Spider 2.0, BEAVER, Snowflake's internal set). With an explicit semantic model: 90%+ (Snowflake Cortex Analyst) to 98-100% on well-modeled questions (dbt, 2026). The spread is the whole story - accuracy tracks the structure around the model, not the model.

How can I improve text-to-SQL accuracy?

Add explicit structure: every published gain comes from a knowledge graph (16.7% → 54.2%), a semantic model (51% → 90%+), or a deterministic semantic layer (98-100%). The durable version is a governed semantic layer that queries compile through - not per-query prompt engineering rebuilt on every request.

A note on the numbers

Every figure is the cited source's own published claim, linked inline, as of 11 June 2026. Benchmarks are dataset-specific and several evaluation sets are small (dbt: 11 questions; Snowflake: 150); leaderboard entries are vendor self-reported. Treat the figures as evidence of a directional pattern, not universal constants. This page is reviewed quarterly.

Your warehouse is not a benchmark. Compile against what your data actually means.