How do you improve text-to-SQL accuracy on real data?

Ground generation in a governed semantic layer: a typed graph that resolves meaning, proves join paths, and refuses on ambiguity instead of guessing. Compile-time approaches like Colrows lift accuracy above raw LLM generation by fixing the context, not the model.

Analytics & Search·05 Jul 2026·Updated 11 Jul 2026·By Nilesh Kumar·All posts

The Enterprise Text-to-SQL Accuracy Benchmark: Every Major Study in One Place

Q: What is the best text-to-SQL benchmark for enterprise use?

Spider 2.0 (Lei et al., ICLR 2025) is the most representative of real enterprise conditions: 632 real-world workflow problems over very large schemas averaging roughly 800 columns per database. On it, an o1-preview code-agent framework solved only about 21.3% of tasks, versus 91.2% on the older Spider 1.0. BEAVER tests real data-warehouse data, where off-the-shelf LLMs scored close to 0% end-to-end.

Q: Why is enterprise text-to-SQL so much harder than benchmarks suggest?

Academic sets like Spider 1.0 use small, clean schemas. Real enterprises have hundreds of columns per table, ambiguous names, undocumented relationships, and governance rules. Accuracy drops from the 90s to the 20s or lower. The semantic model that grounds the query, not the LLM, is the real bottleneck.

The same model that scores 91% on a textbook benchmark scores about 21% on real enterprise data. That gap is the single most important number in enterprise AI analytics, and it is scattered across four papers. This page collects the major text-to-SQL benchmarks in one cited table and chart, so you can reference the real accuracy picture, not the demo one. Free to cite.

Enterprise text-to-SQL accuracy benchmark: the 91% to 21% cliff.

The accuracy cliff, at a glance

Best-published-result accuracy on each benchmark, using strong code-agent frameworks or off-the-shelf LLMs as noted. Higher is better.

Best published execution accuracy per benchmark. Academic sets use small, clean schemas; enterprise sets use real, large ones. Sources below.

The benchmark table

Benchmark	Best result	System / model	What it tests	Source
Spider 1.0	91.2%	o1-preview code-agent framework	Cross-domain academic schemas, small and clean	Lei et al., Spider 2.0 paper (arXiv:2411.07763)
BIRD	73.0%	o1-preview code-agent framework	Larger, dirtier databases with values and evidence	Reported in Spider 2.0 paper; BIRD, NeurIPS 2023
Spider 2.0	21.3%	o1-preview code-agent framework	632 real enterprise workflows; schemas averaging ~800 columns	Lei et al., ICLR 2025 (arXiv:2411.07763)
BEAVER	~0% (end-to-end)	Off-the-shelf GPT-4o, Llama3-70B-Instruct	Real private enterprise data-warehouse queries	BEAVER enterprise benchmark
Vendor internal (context)	90%+	Cortex Analyst on a curated semantic model	Snowflake internal 150-question set (not independent)	Snowflake engineering blog

The bottom row matters for reading the rest: a curated semantic model lifts a Snowflake-internal set to 90%+, while the same class of single-shot LLM sat at 51% on that set. That is the whole thesis in one line: the semantic model does the work, not the raw model.

How to read these numbers

Academic benchmarks flatter models. Spider 1.0's schemas are small and clean. Real warehouses are not. The 91.2%-to-21.3% drop from Spider 1.0 to Spider 2.0 is the same class of system on realistic data.
Enterprise schemas are the hard part. Spider 2.0 databases average roughly 800 columns. Ambiguous names, undocumented joins, and governance rules are the real difficulty, not SQL syntax.
Off-the-shelf LLMs collapse on private data. BEAVER found near-0% end-to-end accuracy for GPT-4o and Llama3-70B on real warehouse data.
Curated context is the lever. Every result above the raw baseline comes from better grounding, a semantic model, not a bigger LLM.

Fix the Context, Not the Model. The benchmarks say it plainly: accuracy tracks the quality of the semantic layer grounding the query, not the size of the model writing it. That is where enterprise text-to-SQL is won or lost.

Closing the gap: the semantic layer

The numbers explain why deterministic, semantic-layer approaches exist. A typed semantic graph resolves meaning, proves the join path, and refuses on ambiguity instead of guessing. Colrows is built on this compile-then-execute model: it emits deterministic, dialect-perfect SQL across 16+ engines with governance enforced before execution. For the narrative version of this data, see the text-to-SQL accuracy cliff; for the tool landscape, see the best text-to-SQL tools; for the architecture, see deterministic vs probabilistic text-to-SQL.

Cite this benchmark

You are welcome to reference this compilation. Suggested citation:

Colrows (2026). The Enterprise Text-to-SQL Accuracy Benchmark. https://colrows.com/blogs/text-to-sql-benchmark/ — compiling Spider 1.0/2.0 (Lei et al., arXiv:2411.07763, ICLR 2025), BIRD (NeurIPS 2023), and BEAVER.

Primary sources: the Spider 2.0 paper (Lei et al., arXiv:2411.07763) reports the Spider 1.0, BIRD, and Spider 2.0 figures for a common o1-preview code-agent framework; the BEAVER benchmark reports the near-0% off-the-shelf result on real warehouse data. Always verify against the primary papers before publishing.

Frequently asked questions

What is the best text-to-SQL benchmark for enterprise use?

Spider 2.0, for its realism: 632 real workflows over schemas averaging ~800 columns. Best published result is about 21.3%. BEAVER tests real warehouse data, where off-the-shelf LLMs score near 0% end-to-end.

Why is enterprise text-to-SQL so much harder than benchmarks suggest?

Academic sets use small, clean schemas; real enterprises have hundreds of columns, ambiguous names, and governance rules. The grounding semantic model, not the LLM, is the bottleneck.

How do you improve accuracy on real data?

Ground generation in a governed semantic layer that resolves meaning, proves joins, and refuses on ambiguity. Fix the context, not the model.

The Enterprise Text-to-SQL Accuracy Benchmark: Every Major Study in One Place

The accuracy cliff, at a glance

The benchmark table

How to read these numbers

Closing the gap: the semantic layer

Cite this benchmark

Frequently asked questions

What is the best text-to-SQL benchmark for enterprise use?

Why is enterprise text-to-SQL so much harder than benchmarks suggest?

How do you improve accuracy on real data?

Related reading

The Text-to-SQL Accuracy Cliff: Why Deterministic Compilers Beat LLM Guessing

The Best Text-to-SQL Tools in 2026, Scored on Accuracy, Governance, and Reproducibility

Deterministic vs. Probabilistic Text-to-SQL: Why Accuracy Matters

Notes from the semantic execution layer.

Close the accuracy cliff with a semantic layer.