What is in this compendium, and how was it measured?
The research is organized around three pillars. First, a practitioner assessment: Colrows surveyed 50 enterprise data leaders about their semantic-layer maintenance costs, read alongside broader enterprise metrics. Second, a benchmark: we ran raw frontier-model text-to-SQL against compiled semantic execution over 50 proprietary enterprise schemas plus the open-source MIMIC-III and eICU medical databases (from the EHRSQL benchmark), sending identical multi-step queries through each path. Third, two named-customer case studies (Cipla and SSP Group) with before-and-after operational metrics. The point is a citable baseline, not a summary of other people's benchmarks.
What does manual semantic maintenance actually cost?
Data quality is usually misdiagnosed as a pipeline problem. The real crisis is semantic decay: the quiet divergence of meaning as definitions drift, legacy SQL gets copied and modified without version control, and the same word means different things to finance and growth. Nothing crashes, so observability tools never fire. Yet the tax is severe. Across our survey and related enterprise metrics, poor data quality and semantic inconsistency cost organizations an average of $12.9 million annually, and BI teams in dashboard-sprawl environments spend up to 80% of their hours maintaining existing assets instead of building new ones.
Trying to fix this with more headcount does not work, because documentation is static while meaning is dynamic. By the time a data dictionary ships, reality has moved. Data stewards ($79,000-$108,000), analytics engineers ($130,000-$145,000), and senior data engineers ($170,000+) end up policing static wikis that never catch up.
| Cost center | Metric / financial impact | Primary cause |
|---|---|---|
| Overall data quality | $12.9M annual cost | Inconsistent metrics, lack of trust, validation debates |
| Operational inefficiency | ~400 incidents / 2,400 hours downtime ($2.6M+) | Silent schema drift and semantic decay |
| BI resource allocation | Up to 80% of BI working hours | Fixing downstream ripple effects of schema changes |
| Regulatory compliance (SOX) | $1M - $2M annually | ~70% administrative overhead from manual lineage |
| Manual labor (per head) | $79,000 - $170,000+ per engineer or steward | Policing static documentation |
This is increasingly a regulatory problem, not just an operational one. Under BCBS 239, banks must aggregate risk data automatically with attribute-level lineage; Citigroup was fined $135.6 million in July 2024 for data-quality-management deficiencies. The EU AI Act (high-risk enforcement from 2 August 2026) demands data lineage, algorithmic traceability, and output reproducibility that probabilistic LLMs generating SQL cannot reliably deliver. We cover the drift mechanics in depth in Knowledge Drift and Semantic Decay.
How far does model accuracy fall on real enterprise schemas?
The story that language models can democratize raw database access rests on sanitized benchmarks. Spider 1.0 used small, clean databases with obvious joins, and frontier models did well: GPT-4o reached 86.6% and an o1-preview agent framework 91.2%. Spider 2.0 (Lei et al., ICLR 2025) rebuilt the test with 632 real enterprise tasks over BigQuery and Snowflake, databases that routinely exceed 1,000 columns and demand 100+ line SQL. Accuracy collapsed. The MIT BEAVER benchmark, built entirely from private warehouse query logs that never touched the public internet, is even harsher, because models cannot lean on memorized schemas.
| Benchmark | Database complexity / origin | Model | Execution accuracy |
|---|---|---|---|
| Spider 1.0 | Sanitized, small schemas, obvious joins | o1-preview agent | 91.2% (baseline) |
| Spider 2.0 | 1,000+ columns, BigQuery/Snowflake, codebase context | o1-preview agent | 21.3% (-69.9 pts) |
| Spider 1.0 | Sanitized, small schemas, obvious joins | GPT-4o | 86.6% (baseline) |
| Spider 2.0 | 1,000+ columns, BigQuery/Snowflake, codebase context | GPT-4o | 10.1% (-76.5 pts) |
| BEAVER (retrieval) | Private enterprise warehouse query logs | GPT-4o | 0.0% |
| BEAVER (no retrieval) | Private enterprise warehouse query logs | GPT-4o | 2.0% |
The full analysis of why this happens is in The Text-to-SQL Accuracy Cliff.
The Colrows first-party benchmark: what happens when the same queries are compiled?
External benchmarks expose the failure mode. To establish our own citable baseline, we ran raw generation against compiled semantic execution over the same test suite: 50 proprietary enterprise schemas plus the open-source MIMIC-III and eICU medical databases, which are notoriously hard because they require parsing hospital terminology, time-sensitive variables, and survival-rate calculations. Then we routed the identical queries through the Colrows deterministic compiler, where intent is resolved against a versioned semantic graph before any SQL runs.
| Test | Database complexity / origin | Execution method | Result |
|---|---|---|---|
| First-party accuracy | 50 proprietary schemas + EMR (MIMIC-III, eICU) | Raw LLM text-to-SQL | 14.5% execution accuracy |
| First-party accuracy | 50 proprietary schemas + EMR (MIMIC-III, eICU) | Colrows semantic compiler | 98.2% execution accuracy |
| Join-path accuracy | Complex multi-table aggregations | Raw LLM text-to-SQL | ~20% (80% hallucination rate) |
| Join-path accuracy | Complex multi-table aggregations | Colrows semantic compiler | 100% |
| Semantic resolution | Benchmark enterprise datasets | Colrows semantic compiler | 0.91 MRR@1, under 80 ms p99 latency |
Raw frontier models averaged 14.5%, failing to find correct join paths and hallucinating metric definitions. The same queries through Colrows reached 98.2%. The mechanism matters more than the headline: business semantics (does "revenue" include deferred subscriptions or exclude local taxes?) do not live in the physical schema, so a model that only reads the schema is guessing. Compiling against a typed graph removes the guess. It also removes the up-to-15% run-to-run variance that probabilistic models show even at temperature zero, which is what makes the output auditable. The methodology and downloadable framing are in the text-to-SQL benchmark, and the architecture is explained in What Is a Semantic Compiler?
Fix the context, not the model. Accuracy tracks the semantic structure surrounding the model, not the scale of the model. A governed semantic graph that resolves meaning at compile time produces more reliable enterprise AI than a larger probabilistic model ever will.
Case study: how did Cipla federate its data without a migration?
Cipla, a global pharmaceutical enterprise, runs a commercial apparatus of over 22,500 field representatives who need hyper-local, real-time intelligence on prescribing patterns, campaign efficacy, and inventory. The problem was fragmentation: CRM in Cirrius, financials in Oracle, logistics in a legacy ERP, with core entities (doctors, brands, campaigns, regions) existing as isolated tables with no shared identity. Every question became an IT ticket, campaign diagnosis took weeks, and a 30% localized stockout rate constrained revenue.
Rather than a multi-year centralization project, Colrows deployed its semantic execution layer over a Trino federated query engine, building a unified knowledge graph across the systems without moving data. Row-level scope was injected at compile time so each of the 22,500 reps saw only their territory.
| Operational dimension | Before (manual BI) | After (compiled semantics) |
|---|---|---|
| Data truth & reporting | Multiple systems, multiple versions of the truth | One semantic layer, one consistent metric definition |
| Decision latency | Campaign underperformance surfaced in days | Underperforming channels flagged 1000× faster |
| Supply chain | 30% stockout rate constraining supply | 30% reduction in stockouts; 15% better inventory turnover |
| Engineering burden | Every new question is an IT ticket | Business teams self-serve; IT freed for higher-value work |
Read the full Cipla case study.
Case study: how did SSP Group push decisions to the edge?
SSP Group operates 3,000+ food-and-beverage units across nearly 40 countries with roughly 49,000 staff. Engineering was fragmented across DBeaver, Squirrel, and Toad with no version control; the infrastructure team maintained cumbersome bastion hosts; a custom in-house reporting system had become unmaintainable; and 49,000 frontline staff had no self-serve access. Colrows replaced the fragmented client ecosystem with one governed workspace (SQL plus embedded Python notebooks), dismantled the bastion hosts, and put a conversational interface in Slack so store managers could ask questions in plain language, compiled deterministically against the semantic graph.
| Operational dimension | Before (fragmented legacy) | After (unified semantic platform) |
|---|---|---|
| Developer tooling & collaboration | A different tool per developer (DBeaver, Squirrel, Toad) | 80% improvement in cross-team collaboration |
| Front-line data access | Front-line staff had no self-serve access | 3× faster issue resolution via natural language |
| Infrastructure overhead | Bastion infrastructure piled on overhead | 40% reduction in data-management overhead |
| Technical debt | Custom in-house system could not scale | Custom internal system retired entirely |
Read the full SSP Group case study.
Why do the other semantic layers not clear this bar?
The market has several generations of semantic layer, each optimized for a different consumer. Presentation-time semantics (Looker/LookML) lock logic inside a BI tool. Metric stores (dbt Semantic Layer, Cube) decouple metrics from the dashboard but standardize arithmetic for human consumption and rely on manual authoring. OLAP-governance layers (AtScale) accelerate cube queries but require upfront modeling. All of them assume a human analyst on the other end. Colrows operates at a different altitude: it standardizes systemic meaning (entities, relational paths, security policies) into an autonomously built graph, and resolves it at compile time into deterministic, auditable SQL for machine agents. See the full platform comparison.
Frequently asked questions
How accurate is text-to-SQL on real enterprise data?
On sanitized academic benchmarks frontier models reach 86-91%, but on realistic enterprise data they collapse: Spider 2.0 dropped the o1-preview agent to 21.3% and GPT-4o to 10.1%, and on MIT's BEAVER (private query logs) GPT-4o scored 0.0% with retrieval and 2.0% without. In the Colrows first-party test, raw models averaged 14.5%.
What did the Colrows first-party benchmark measure and find?
Identical multi-step business and clinical queries were routed through raw LLM generation and through the Colrows compiler across 50 proprietary schemas plus MIMIC-III and eICU. Raw generation scored 14.5%; the compiler scored 98.2%, with 100% join-path accuracy (versus an 80% hallucination rate for raw models) and 0.91 MRR@1 semantic resolution at under 80 ms p99 latency.
How much does poor data quality and semantic decay cost enterprises?
An average of $12.9 million annually, per our survey of 50 data leaders and related enterprise metrics. The typical enterprise also absorbs about 400 data incidents and 2,400 hours of downtime a year (over $2.6 million), and BI teams can spend up to 80% of their time on maintenance.
Why does compiled semantic execution beat raw text-to-SQL?
Because accuracy tracks the semantic structure around the model, not its size. Colrows resolves intent against a versioned typed graph, proves join paths, injects governance before execution, and emits deterministic SQL, removing both the guesswork and the run-to-run variance that break audit.
Are these results reproducible?
Yes. The benchmark uses public EMR schemas (MIMIC-III and eICU) alongside proprietary ones, and compiled execution is deterministic: the same query compiles to the same SQL against a versioned definition, so results can be re-run with the definitions in force at the time.



