Research·05 Jul 2026 · Updated 16 Jul 2026·By Colrows Research·All posts

The Efficacy of Autonomous Semantic Execution: A First-Party Research Compendium

Q: How accurate is text-to-SQL on real enterprise data?

On sanitized academic benchmarks (Spider 1.0) frontier models reach 86-91% accuracy, but on realistic enterprise data they collapse. On Spider 2.0 the o1-preview agent fell to 21.3% and GPT-4o to 10.1%; on the MIT BEAVER benchmark (private warehouse query logs) GPT-4o scored 0.0% with retrieval and 2.0% without. In the Colrows first-party test across 50 proprietary and EMR schemas, raw frontier models averaged 14.5% execution accuracy.

Q: What did the Colrows first-party benchmark measure and find?

Colrows ran identical multi-step business and clinical queries through raw LLM text-to-SQL generation and through its deterministic semantic compiler, across 50 proprietary enterprise schemas plus the open-source MIMIC-III and eICU medical databases. Raw generation scored 14.5% execution accuracy; the Colrows compiler scored 98.2%. On complex multi-table aggregations, raw models had an 80% join-path hallucination rate while the compiler held 100% join-path accuracy, with semantic resolution of 0.91 MRR@1 at under 80 ms p99 latency.

Q: How much does poor data quality and semantic decay cost enterprises?

In a Colrows survey of 50 data leaders and related enterprise metrics, poor data quality and semantic inconsistency cost organizations an average of $12.9 million annually. The typical enterprise also absorbs roughly 400 data incidents and 2,400 hours of downtime per year (over $2.6 million), and BI teams in dashboard-sprawl environments spend up to 80% of their hours maintaining existing assets rather than building new ones.

Q: Why does compiled semantic execution beat raw text-to-SQL?

Accuracy tracks the semantic structure around the model, not the size of the model. Raw generation must guess join paths and metric definitions that do not live in the physical schema. Colrows resolves intent against a versioned, typed semantic graph, proves join paths at compile time, injects governance before execution, and emits dialect-perfect SQL. This also removes the up-to-15% run-to-run variance that probabilistic models exhibit even at temperature zero, which matters for audit and reproducibility.

Q: Are these results reproducible?

Yes. The benchmark uses publicly available EMR schemas (MIMIC-III and eICU via the EHRSQL benchmark) alongside proprietary enterprise schemas, and compiled semantic execution is deterministic: identical queries compile to identical SQL against a versioned semantic definition, so a result can be re-run and reproduced with the definitions in force at the time.

Most claims about enterprise AI cite someone else's numbers. This is first-party research. It combines a practitioner survey of 50 data leaders, a Colrows-run benchmark that put raw text-to-SQL and compiled semantic execution through the exact same queries, and two named-customer case studies. Every figure below is measured, not borrowed.

Key findings

14.5% → 98.2%

Execution accuracy: raw LLM vs. Colrows compiler, same queries

100%

Join-path accuracy vs. an 80% hallucination rate for raw models

0.91

Semantic resolution (MRR@1) at under 80 ms p99 latency

$12.9M

Average annual cost of poor data quality (survey of 50 leaders)

What is in this compendium, and how was it measured?

The research is organized around three pillars. First, a practitioner assessment: Colrows surveyed 50 enterprise data leaders about their semantic-layer maintenance costs, read alongside broader enterprise metrics. Second, a benchmark: we ran raw frontier-model text-to-SQL against compiled semantic execution over 50 proprietary enterprise schemas plus the open-source MIMIC-III and eICU medical databases (from the EHRSQL benchmark), sending identical multi-step queries through each path. Third, two named-customer case studies (Cipla and SSP Group) with before-and-after operational metrics. The point is a citable baseline, not a summary of other people's benchmarks.

What does manual semantic maintenance actually cost?

Data quality is usually misdiagnosed as a pipeline problem. The real crisis is semantic decay: the quiet divergence of meaning as definitions drift, legacy SQL gets copied and modified without version control, and the same word means different things to finance and growth. Nothing crashes, so observability tools never fire. Yet the tax is severe. Across our survey and related enterprise metrics, poor data quality and semantic inconsistency cost organizations an average of $12.9 million annually, and BI teams in dashboard-sprawl environments spend up to 80% of their hours maintaining existing assets instead of building new ones.

Trying to fix this with more headcount does not work, because documentation is static while meaning is dynamic. By the time a data dictionary ships, reality has moved. Data stewards ($79,000-$108,000), analytics engineers ($130,000-$145,000), and senior data engineers ($170,000+) end up policing static wikis that never catch up.

Cost center	Metric / financial impact	Primary cause
Overall data quality	$12.9M annual cost	Inconsistent metrics, lack of trust, validation debates
Operational inefficiency	~400 incidents / 2,400 hours downtime ($2.6M+)	Silent schema drift and semantic decay
BI resource allocation	Up to 80% of BI working hours	Fixing downstream ripple effects of schema changes
Regulatory compliance (SOX)	$1M - $2M annually	~70% administrative overhead from manual lineage
Manual labor (per head)	$79,000 - $170,000+ per engineer or steward	Policing static documentation

This is increasingly a regulatory problem, not just an operational one. Under BCBS 239, banks must aggregate risk data automatically with attribute-level lineage; Citigroup was fined $135.6 million in July 2024 for data-quality-management deficiencies. The EU AI Act (high-risk enforcement from 2 August 2026) demands data lineage, algorithmic traceability, and output reproducibility that probabilistic LLMs generating SQL cannot reliably deliver. We cover the drift mechanics in depth in Knowledge Drift and Semantic Decay.

How far does model accuracy fall on real enterprise schemas?

The story that language models can democratize raw database access rests on sanitized benchmarks. Spider 1.0 used small, clean databases with obvious joins, and frontier models did well: GPT-4o reached 86.6% and an o1-preview agent framework 91.2%. Spider 2.0 (Lei et al., ICLR 2025) rebuilt the test with 632 real enterprise tasks over BigQuery and Snowflake, databases that routinely exceed 1,000 columns and demand 100+ line SQL. Accuracy collapsed. The MIT BEAVER benchmark, built entirely from private warehouse query logs that never touched the public internet, is even harsher, because models cannot lean on memorized schemas.

Benchmark	Database complexity / origin	Model	Execution accuracy
Spider 1.0	Sanitized, small schemas, obvious joins	o1-preview agent	91.2% (baseline)
Spider 2.0	1,000+ columns, BigQuery/Snowflake, codebase context	o1-preview agent	21.3% (-69.9 pts)
Spider 1.0	Sanitized, small schemas, obvious joins	GPT-4o	86.6% (baseline)
Spider 2.0	1,000+ columns, BigQuery/Snowflake, codebase context	GPT-4o	10.1% (-76.5 pts)
BEAVER (retrieval)	Private enterprise warehouse query logs	GPT-4o	0.0%
BEAVER (no retrieval)	Private enterprise warehouse query logs	GPT-4o	2.0%

The full analysis of why this happens is in The Text-to-SQL Accuracy Cliff.

The Colrows first-party benchmark: what happens when the same queries are compiled?

External benchmarks expose the failure mode. To establish our own citable baseline, we ran raw generation against compiled semantic execution over the same test suite: 50 proprietary enterprise schemas plus the open-source MIMIC-III and eICU medical databases, which are notoriously hard because they require parsing hospital terminology, time-sensitive variables, and survival-rate calculations. Then we routed the identical queries through the Colrows deterministic compiler, where intent is resolved against a versioned semantic graph before any SQL runs.

Test	Database complexity / origin	Execution method	Result
First-party accuracy	50 proprietary schemas + EMR (MIMIC-III, eICU)	Raw LLM text-to-SQL	14.5% execution accuracy
First-party accuracy	50 proprietary schemas + EMR (MIMIC-III, eICU)	Colrows semantic compiler	98.2% execution accuracy
Join-path accuracy	Complex multi-table aggregations	Raw LLM text-to-SQL	~20% (80% hallucination rate)
Join-path accuracy	Complex multi-table aggregations	Colrows semantic compiler	100%
Semantic resolution	Benchmark enterprise datasets	Colrows semantic compiler	0.91 MRR@1, under 80 ms p99 latency

Raw frontier models averaged 14.5%, failing to find correct join paths and hallucinating metric definitions. The same queries through Colrows reached 98.2%. The mechanism matters more than the headline: business semantics (does "revenue" include deferred subscriptions or exclude local taxes?) do not live in the physical schema, so a model that only reads the schema is guessing. Compiling against a typed graph removes the guess. It also removes the up-to-15% run-to-run variance that probabilistic models show even at temperature zero, which is what makes the output auditable. The methodology and downloadable framing are in the text-to-SQL benchmark, and the architecture is explained in What Is a Semantic Compiler?

Core principle

Fix the context, not the model. Accuracy tracks the semantic structure surrounding the model, not the scale of the model. A governed semantic graph that resolves meaning at compile time produces more reliable enterprise AI than a larger probabilistic model ever will.

Case study: how did Cipla federate its data without a migration?

Cipla, a global pharmaceutical enterprise, runs a commercial apparatus of over 22,500 field representatives who need hyper-local, real-time intelligence on prescribing patterns, campaign efficacy, and inventory. The problem was fragmentation: CRM in Cirrius, financials in Oracle, logistics in a legacy ERP, with core entities (doctors, brands, campaigns, regions) existing as isolated tables with no shared identity. Every question became an IT ticket, campaign diagnosis took weeks, and a 30% localized stockout rate constrained revenue.

Rather than a multi-year centralization project, Colrows deployed its semantic execution layer over a Trino federated query engine, building a unified knowledge graph across the systems without moving data. Row-level scope was injected at compile time so each of the 22,500 reps saw only their territory.

Operational dimension	Before (manual BI)	After (compiled semantics)
Data truth & reporting	Multiple systems, multiple versions of the truth	One semantic layer, one consistent metric definition
Decision latency	Campaign underperformance surfaced in days	Underperforming channels flagged 1000× faster
Supply chain	30% stockout rate constraining supply	30% reduction in stockouts; 15% better inventory turnover
Engineering burden	Every new question is an IT ticket	Business teams self-serve; IT freed for higher-value work

Read the full Cipla case study.

Case study: how did SSP Group push decisions to the edge?

SSP Group operates 3,000+ food-and-beverage units across nearly 40 countries with roughly 49,000 staff. Engineering was fragmented across DBeaver, Squirrel, and Toad with no version control; the infrastructure team maintained cumbersome bastion hosts; a custom in-house reporting system had become unmaintainable; and 49,000 frontline staff had no self-serve access. Colrows replaced the fragmented client ecosystem with one governed workspace (SQL plus embedded Python notebooks), dismantled the bastion hosts, and put a conversational interface in Slack so store managers could ask questions in plain language, compiled deterministically against the semantic graph.

Operational dimension	Before (fragmented legacy)	After (unified semantic platform)
Developer tooling & collaboration	A different tool per developer (DBeaver, Squirrel, Toad)	80% improvement in cross-team collaboration
Front-line data access	Front-line staff had no self-serve access	3× faster issue resolution via natural language
Infrastructure overhead	Bastion infrastructure piled on overhead	40% reduction in data-management overhead
Technical debt	Custom in-house system could not scale	Custom internal system retired entirely

Read the full SSP Group case study.

Why do the other semantic layers not clear this bar?

The market has several generations of semantic layer, each optimized for a different consumer. Presentation-time semantics (Looker/LookML) lock logic inside a BI tool. Metric stores (dbt Semantic Layer, Cube) decouple metrics from the dashboard but standardize arithmetic for human consumption and rely on manual authoring. OLAP-governance layers (AtScale) accelerate cube queries but require upfront modeling. All of them assume a human analyst on the other end. Colrows operates at a different altitude: it standardizes systemic meaning (entities, relational paths, security policies) into an autonomously built graph, and resolves it at compile time into deterministic, auditable SQL for machine agents. See the full platform comparison. For the AI-agent era specifically, read our comparative analysis of Colrows, ThoughtSpot, Timbr.ai, and Databricks Genie.

Frequently asked questions

How accurate is text-to-SQL on real enterprise data?

On sanitized academic benchmarks frontier models reach 86-91%, but on realistic enterprise data they collapse: Spider 2.0 dropped the o1-preview agent to 21.3% and GPT-4o to 10.1%, and on MIT's BEAVER (private query logs) GPT-4o scored 0.0% with retrieval and 2.0% without. In the Colrows first-party test, raw models averaged 14.5%.

What did the Colrows first-party benchmark measure and find?

Identical multi-step business and clinical queries were routed through raw LLM generation and through the Colrows compiler across 50 proprietary schemas plus MIMIC-III and eICU. Raw generation scored 14.5%; the compiler scored 98.2%, with 100% join-path accuracy (versus an 80% hallucination rate for raw models) and 0.91 MRR@1 semantic resolution at under 80 ms p99 latency.

How much does poor data quality and semantic decay cost enterprises?

An average of $12.9 million annually, per our survey of 50 data leaders and related enterprise metrics. The typical enterprise also absorbs about 400 data incidents and 2,400 hours of downtime a year (over $2.6 million), and BI teams can spend up to 80% of their time on maintenance.

Why does compiled semantic execution beat raw text-to-SQL?

Because accuracy tracks the semantic structure around the model, not its size. Colrows resolves intent against a versioned typed graph, proves join paths, injects governance before execution, and emits deterministic SQL, removing both the guesswork and the run-to-run variance that break audit.

Are these results reproducible?

Yes. The benchmark uses public EMR schemas (MIMIC-III and eICU) alongside proprietary ones, and compiled execution is deterministic: the same query compiles to the same SQL against a versioned definition, so results can be re-run with the definitions in force at the time.