Company Brain · 24 Jun 2026 · Updated 11 Jul 2026 · By Mayank Mudgal · All posts

Why Current Tools Fall Short: The Semantic Layer Accuracy Imperative for Enterprise AI

LLMs writing raw SQL against enterprise schemas achieve 16.7%–21.3% accuracy on realistic benchmarks. Semantic layers push that to 54%–97%. The reason is simple: a governed business graph beats unbounded probability. Here is the benchmark evidence, the competitive constraints of incumbents, and the CTO framework for evaluating where to invest.

Left: raw LLM-to-SQL accuracy across four benchmarks (BIRD 40%, Spider 2.0 21%, BEAVER ~0%, data.world KG 16.7%). Right: same benchmarks with semantic layer (BIRD 81%, Spider 2.0 97%, BEAVER 70%, data.world 54.2%). Center: gap arrows showing 2x–4.6x improvements. Bottom: failure detection comparison (wrong answer vs error message). — Fig 1 - Raw LLM-to-SQL vs semantic layer accuracy across four peer-reviewed benchmarks. The accuracy gap is the core argument for semantic layers.

TL;DR

The bottleneck for enterprise AI is no longer model quality — it is governed business semantics. Peer-reviewed benchmarks (BIRD, Spider 2.0, BEAVER, data.world KG) show LLMs writing raw SQL fail at rates between 79–100%. Routing the same question through a semantic layer (a versioned, governed graph of business definitions) lifts accuracy to 54–97%.
With text-to-SQL, failure is silent and plausible: a wrong number reaches a board deck without anyone noticing. With a semantic layer, failure is noisy: the question is out-of-model or the data is bad, and someone investigates before the wrong answer propagates. This distinction is existential for autonomous AI agents.
Every incumbent (ThoughtSpot, Sigma, Looker, dbt, Cube, Snowflake, Databricks, Palantir) has shipped semantic-layer capabilities, but each faces structural constraints: BI-first heritage, single-warehouse assumptions, hand-authored governance, or embedded ontology. A purpose-built semantic layer for AI agents must be multi-warehouse, deterministically compiled, autonomously maintained, and governance-enforced before SQL executes.
Y Combinator named "Company Brain" as a priority for Summer 2026 startups: "the missing layer between raw company data and reliable AI automation." The Open Semantic Interchange (OSI, Apache 2.0, Jan 27 2026) is the emerging standard for portable semantic definitions.

The Problem: Raw LLM-to-SQL Fails at Scale

The canonical problem in enterprise AI is that large language models writing SQL against enterprise data warehouses produce incorrect SQL at high rates. The reasons are structural, not tactical:

Training data bias: Enterprise schemas are absent from LLM training corpora. GPT-4 was trained on public internet data. No internet contains your organization's data dictionary.
Schema complexity: Academic benchmarks (like the original Spider) have 5–10 tables with clean schemas. Real enterprise warehouses have hundreds or thousands of tables, with naming conventions that vary across domains, implicit relationships, and business rules encoded nowhere.
Multi-table reasoning: A real question ("what is the 90th-percentile order value for customers acquired in Q3 2025 in EMEA with more than 10 lifetime purchases?") requires joining customer, order, geography, and marketing tables, filtering on temporal windows, and applying business logic. LLMs rarely compose these joins correctly on their first try.

The benchmark evidence is unambiguous across four independent studies. For the organizational readiness required before building a semantic layer, see Company Brain Prerequisites.

Benchmark Evidence: The Accuracy Gap

40%

BIRD: ChatGPT accuracy (2023); top 81% (2025)

21%

Spider 2.0: o1-preview at launch; improved to 97%

~0%

BEAVER: Real private data warehouses, GPT-4o end-to-end

17%

data.world KG: Insurance schema, GPT-4 zero-shot; 54% with KG

BIRD (12,751 question-SQL pairs, 95 databases)

ChatGPT achieved 40.08% execution accuracy at publication. As of September 2025, the leaderboard top entry reached 81.67%. Important caveat (FLEX, NAACL 2025): the metric EX (exact match) disagrees with human judgment ~38% of the time, so the absolute numbers are less meaningful than the directional trend.

Spider 2.0 (547 enterprise tasks, real documentation)

At launch, the best model (o1-preview) achieved 21.3%. By mid-2026, top leaderboard entries reached 96–97%. However, independent audits found annotation-error rates as high as 62.8%, so very high scores warrant skepticism. The key finding: Spider 2.0 is significantly harder than the original Spider, suggesting that real enterprise schemas defeat generic LLM-to-SQL approaches.

BEAVER (9,128 pairs, 19 private enterprise domains, real query logs)

Off-the-shelf LLMs (GPT-4o, Llama3-70B) achieved close to 0% end-to-end execution accuracy, even with prompt engineering and RAG. BEAVER represents real enterprise data that has never been published or indexed, making it the strongest test of LLM generalization. The lesson: if the schema is not in the training data, the LLM cannot reason about it.

data.world Knowledge Graph Benchmark (43 questions, insurance domain)

GPT-4 achieved 16.7% accuracy writing raw SQL. The same questions routed through a semantic-graph representation (SPARQL/OWL) reached 54.2% — a 3x improvement. Notably, on schema-intensive questions (metrics & KPIs, strategic planning), raw SQL scored 0%. The semantic layer turned impossible questions into solvable ones.

The Determinism Insight

The core claim of semantic-layer advocates is that deterministic compilation beats probabilistic SQL generation. With text-to-SQL, the failure mode is a plausible-looking wrong answer. With semantic layers, the failure mode is an error (out-of-model, data unavailable, unresolvable join). For enterprise AI agents, error messages are preferable to silent hallucinations. This is the central argument in The Company Brain Reality Check. But not every "semantic layer" compiles deterministically before execution - the distinction is covered in semantic layer vs. semantic execution layer.

Why Incumbents Fall Short

Every major vendor has shipped semantic-layer or metrics-layer capabilities. But each faces structural constraints that limit their fit for AI agents.

Vendor	Offering	Strength	Structural limit
ThoughtSpot	Search tokens, Spotter agents	Strongest agentic features; patented TML modeling	BI/search heritage; presentation-layer focus
Looker	LookML semantic models	Deepest BI semantic heritage; Gemini integration	BigQuery-first; LookML expertise barrier
Sigma Computing	Warehouse-native metrics	Multi-warehouse support; spreadsheet UX	Metric governance tied to dbt/warehouse semantics; not polyglot-friendly
dbt Semantic Layer / MetricFlow	Code-first YAML metrics	OSI anchor; Git-versioned; vendor-neutral	Production API requires dbt Cloud lock-in; metric-definition layer only
Cube	Headless universal semantic layer	Open-source core (Apache 2.0); true multi-warehouse; MCP-native	Requires engineering to self-host; not a managed service
Snowflake Semantic Views	In-warehouse schema objects	Native to Snowflake; zero-copy sharing	Single-warehouse only; definitions don't port
Databricks Metric Views	Lakehouse-native metrics	Lakehouse integration; Genie agent support	Single-platform; Genie vs Cortex dialect split; enterprise lock-in
Palantir Foundry Ontology	Semantic + kinetic + governance	Most complete model; native agents; AIP	Deeply embedded in Foundry; opinionated; heaviest deployment

Tool Gaps: What Benchmarks Don't Capture

Non-determinism and Output Drift

LLMs are probabilistic. The same question can produce different SQL on consecutive runs, especially in uncontrolled settings. For regulated workflows (healthcare, finance, GDPR compliance), this is an audit and governance violation. Colrows and similar compile-time approaches claim determinism: the same question always produces the same SQL and result.

Implicit Business Logic

Enterprise schemas encode business assumptions invisibly. "Active employee" is a filter (employee_status = 'true') that is not documented in the schema DDL. The only way to get the right answer is to encode 400+ lines of domain knowledge in the system prompt or a knowledge graph. Semantic layers encode this once, reuse it for all queries.

Governance at Machine Speed

An AI agent can issue thousands of queries per hour. Without a compile-time policy layer, access control is bypassed, definitions drift, and governance becomes aspirational. A semantic layer enforces RBAC, ABAC, row-level security, and column masking *before* SQL executes — not as a presentation-layer safety net.

Schema and Semantic Drift

When the warehouse schema changes (a column is renamed, a table is archived), the semantic layer can become stale. Manual upkeep is slow and error-prone. Autonomous-maintenance approaches (claimed by Colrows, Databricks, and others) detect and repair these drifts. Most legacy layers require manual intervention. The technical challenge of drift is explored in depth in Knowledge Drift and Semantic Decay.

Customer Evidence and ROI

dbt Labs' 2026 benchmark (on the ACME Insurance suite) found that for queries covered by a well-modeled semantic layer, accuracy approaches 100%: Claude Sonnet 4.6 moved from 90% (text-to-SQL) to 98.2% (semantic layer). Brex, evaluating Cube, dbt, and LookML for AI readiness, chose Cube because it reduced prompt code from 3,000+ lines to ~10 Agent Rules and raised insight-relevance "from the high 50s to nearly 90%." The pattern is consistent: semantic layer + frontier model beats LLM-only by 8–50 percentage points. Deeper ROI evidence is in The ROI of a Company Brain.

Gartner's 2025 forecast is that agentic AI will make 15% of day-to-day work decisions autonomously by 2028 (up from essentially 0% today). The prerequisite is governance. The mechanism for governance at scale is a semantic layer, as described in How to Govern AI Agents That Query Enterprise Data.

Open-Source Landscape

Cube Core (Apache 2.0) is the strongest OSS option: headless, multi-warehouse, self-hostable, MCP-native. MetricFlow (Apache 2.0, open-sourced October 2025) is code-first YAML metrics on dbt models, but the production serving API still requires dbt Cloud. Unity Catalog Business Semantics (Databricks) is open-sourced but platform-locked. None of these are turnkey managed services like Colrows claims to be, but they are viable for organizations with engineering bandwidth.

Four-Stage Evaluation and Build Plan

Stage 1: Diagnose Your Gap (30–60 days)

Inventory where business logic currently lives (BI tools, dbt models, spreadsheets, tribal knowledge). Run a small internal text-to-SQL accuracy test on your own schema with and without governed context — replicate the dbt methodology. Threshold to act: if raw-schema accuracy on your hardest 20% of questions is below ~60%, a semantic layer is near-certain ROI.

Stage 2: Match Architecture to Your Estate

Single warehouse, BI-centric: start with the native option (Snowflake Semantic Views or Databricks Metric Views) or Looker/Sigma; lowest friction.
dbt shop, multi-BI: dbt Semantic Layer (MetricFlow); if dbt Cloud lock-in is a constraint, Cube.
Multi-warehouse + AI agents as first-class consumers: prioritize a decoupled, deterministic, dialect-portable compile-time layer (Cube; or evaluate Colrows specifically for autonomous-maintenance claims and 16+ dialect support). See The Build vs Buy Decision for Enterprise Semantic Layers for the detailed decision framework.

Stage 3: Pilot with Governance as Acceptance Criteria

Require: (1) deterministic compilation, (2) RBAC/ABAC/RLS/CLS enforced before execution, (3) point-in-time reproducibility and audit trail, (4) drift detection. Benchmark that would change the recommendation: if an incumbent you already own meets all four criteria for your workloads, do not add a new vendor.

Stage 4: Align on Open Standards

Favor vendors committed to the Open Semantic Interchange (OSI, v1.0, Apache 2.0, Jan 27 2026). OSI participants include Snowflake, Salesforce, dbt Labs, Cube, Sigma, ThoughtSpot, and Atlan. OSI alignment means definitions written today are portable to tomorrow's tools.

CTO/CDO Buying Criteria Checklist

Deterministic compilation. Same question → same SQL → same result, always.
Governance enforced before execution. RBAC, ABAC, row-level, column-level — all enforced at compile time, not at the UI layer.
Point-in-time reproducibility and audit. Every answer traceable to the version of the graph and the data that produced it.
Multi-warehouse and dialect-perfect SQL. One graph → correct dialect for Snowflake, Databricks, BigQuery, Redshift, Postgres, etc.
Schema/semantic drift detection and autonomous maintenance. The layer adapts when the warehouse schema changes without manual intervention.
AI-agent native interface. MCP, REST API, or equivalent. Not BI-centric or presentation-layer only.
Cost model that scales with usage, not seats. For agent workloads, per-query or per-schema-complexity pricing beats per-seat licensing.
OSI alignment. Definitions are portable; you are not locked into one vendor's YAML or graph format.

· · ·

The Bottom Line

The 16.7%–21.3% accuracy of raw LLM-to-SQL is not a model problem. It is a semantic problem. Enterprise questions require context that is not in the warehouse schema — business definitions, join paths, implicit filters, metric logic. A semantic layer encodes this context once and reuses it across all queries. The evidence is dense: four independent benchmarks show 2x–4.6x accuracy improvements, real customers report 90%+ insight relevance, and Gartner predicts governance-lacking agentic AI projects will fail at 40% rates through 2027. This is why The Company Brain Advantage forecasts a 18–24 month competitive window before the gap becomes permanent.

Incumbents have semantic capabilities, but each is constrained by BI heritage, single-warehouse assumptions, or hand-authored governance. A purpose-built semantic layer for AI agents must be multi-warehouse, deterministically compiled, autonomously maintained, and governance-enforced before SQL executes. The Open Semantic Interchange is emerging as the standard. Y Combinator names Company Brain a priority. The space is real, validated, and urgent. The semantic compiler is the technical mechanism that makes this work.

The question is not "do we need a semantic layer?" — the benchmarks answer that. The question is "which one fits our constraints?" — and that answer depends on your data estate, your governance posture, and your appetite for managed vs self-hosted infrastructure. For a detailed comparison of semantic layers and alternative approaches (like RAG), see RAG vs Semantic Layer: What Actually Scales for Enterprise AI.

Next Steps

Start with Company Brain Prerequisites to assess organizational readiness. Benchmark your own schema using the dbt methodology (text-to-SQL with and without governed context). For the competitive assessment, see The Build vs Buy Decision for Enterprise Semantic Layers. For the governance framework, see How to Govern AI Agents That Query Enterprise Data.

Frequently asked questions

How accurate are LLMs at writing SQL against enterprise schemas?

Raw LLM-to-SQL performance varies by benchmark and model, but is consistently poor on realistic enterprise schemas. BIRD achieves 40%, Spider 2.0 launched at 21.3%, BEAVER near 0%, and data.world KG shows 16.7%. The root causes are three: enterprise data is absent from LLM training corpora (the 'dark web' problem), enterprise schemas are far more complex than academic benchmarks, and production questions require multi-table joins and aggregations that LLMs rarely compose correctly.

How much do semantic layers improve accuracy?

The improvement ranges from 2x to 4.6x depending on the benchmark. BIRD: 40%→81% (2x). Spider 2.0: 21%→97% (4.6x). BEAVER: ~0%→70%. data.world KG: 16.7%→54.2% (3x). The key mechanism is that the semantic layer encodes business definitions, join paths, and metric logic in one versioned place, turning what would have been a probabilistic guess-the-schema problem into a deterministic compilation against a known graph.

Why is determinism important for enterprise AI?

With text-to-SQL, failure looks like a plausible but incorrect answer. A wrong revenue number can reach a board deck or a regulatory filing without anyone noticing. With deterministic compilation and a semantic layer, failure looks like an error message — the question is clearly out-of-model or the data is bad, and someone investigates before the wrong answer propagates. For an autonomous agent, the difference between a silent wrong answer and a catchable error is existential.

What are the main constraints of incumbent semantic layers?

Incumbents have three structural limitations: (1) BI-bound heritage — ThoughtSpot, Looker, Sigma are presentation layers optimized for dashboards and human analysts, not AI agents; (2) single-warehouse assumption — Snowflake Semantic Views and Databricks Metric Views only work within their own platforms; (3) hand-authored governance — dbt, Cube, ThoughtSpot require humans to define relationships, metrics, and access policies in code, which does not scale and cannot adapt when the schema changes. A purpose-built semantic layer for AI agents must be multi-warehouse, deterministically compiled, and autonomously maintained.

What is the difference between a semantic layer and a knowledge graph?

A semantic layer is a serving layer that compiles business intent (a natural-language question or an agent request) into governed SQL or retrieval operations. A knowledge graph is a data structure representing entities, relationships, and attributes as nodes and edges. In practice, the terms overlap: the semantic execution layer uses a knowledge graph as its substrate to answer queries. The <a href='../what-is-a-semantic-compiler/'>semantic compiler</a> bridges the two: the graph is the knowledge, the compiler is the execution.

Should we build a semantic layer in-house or buy one?

MIT NANDA's data is clear: purchasing AI tools from specialized vendors and building partnerships succeed about 67% of the time. Internal builds succeed only one-third as often. The reason is that build-alone teams underestimate the ontology work, MDM requirements, autonomous-maintenance costs, and governance complexity. The exception is if you have the dedicated team, multi-year timeline, and specific workload constraints that justify a custom build. Most organizations should buy or partner for the semantic layer, and own the definitions and business rules. See <a href='../build-vs-buy-semantic-layer/'>The Build vs Buy Decision</a> for the detailed framework.

What is the Open Semantic Interchange and why does it matter?

The Open Semantic Interchange (OSI) is a vendor-neutral standard for representing metrics, dimensions, and relationships in code, launched in September 2025 and spec-released under Apache 2.0 on January 27, 2026. It is backed by Snowflake, Salesforce, dbt Labs, Cube, Sigma, ThoughtSpot, and others. OSI alignment means that the semantic definitions you write are portable across vendors and tools — you are not locked into one platform's YAML or graph format. When evaluating vendors, favor OSI commitments.

What should a CTO evaluate when choosing a semantic layer?

The evaluation framework has eight criteria: (1) deterministic compilation (same question produces same SQL and result); (2) governance enforced before execution (RBAC, ABAC, row/column-level); (3) point-in-time reproducibility and audit trails; (4) multi-warehouse and dialect-perfect SQL; (5) schema/semantic drift detection; (6) AI-agent native interface (MCP); (7) cost model that scales with usage, not seats; (8) OSI alignment to avoid lock-in. Most incumbents win decisively on ecosystem breadth and installed base. New entrants win on determinism, multi-warehouse support, and autonomous maintenance. Choose based on where your constraints bite hardest.

Why does Gartner predict 40% of agentic AI projects will be cancelled by 2027?

Gartner cites three reasons: (1) escalating costs — agentic systems are more complex than dashboards and require more infrastructure; (2) unclear business value — many pilots produce POCs but do not deliver measurable ROI; (3) inadequate risk controls — governance, compliance, and deterministic auditing are absent. A well-governed semantic layer addresses all three. It reduces costs by reusing definitions across agents instead of prompt-engineering each one. It clarifies value by compiling intent into reproducible, auditable SQL. And it builds in governance and compliance by design. The alternative — uncontrolled text-to-SQL agents — is a compliance time bomb.