Three intent sources - AI agent, semantic layer, and BI tool - flow downward into a central glowing band labelled 'SQL: Intermediate Representation', which then flows into three execution engines (warehouse, lakehouse, transactional database) at the bottom. The figure illustrates SQL as a compile target between intent and execution.

Semantic Layer & AI Agents·30 Apr 2026·Updated 11 Jul 2026·By Yogendra Sharma·All posts

SQL as a Compiler Target: The Future of Governed Enterprise AI

Probabilistic SQL generation is a bottleneck to enterprise adoption. When your AI agents treat SQL as a "best guess" rather than a deterministic target, you are not building production infrastructure. You are building a liability. SQL should be the result of a compilation process, not the output of a prompt.

Three intent sources flow downward into a central glowing band labelled SQL Intermediate Representation, which then flows into three execution engines at the bottom - illustrating SQL as a compile target between intent and execution.

LLM prompting vs. compiler targeting

Capability	LLM Prompting (The Guess)	Colrows Compiler Target (The Goal)
Logic source	Probabilistic prompt	Formal semantic definition
SQL generation	Black-box / brittle	Governed / deterministic
Schema drift	High risk (hallucination)	Self-healing / resilient
Auditability	Manual / difficult	Automated / full lineage
Enterprise fit	Experimental / prototyping	Production / mission-critical

The architecture gap

The prompt-to-SQL fallacy

Standard LLM workflows treat SQL as the final output. The model sees a question, looks for patterns in training data, and predicts the most likely SQL tokens. This lacks the semantic guardrails to understand enterprise context. A business rule (active revenue excludes refunds and credits) becomes a guess about column names and filter logic. A metric definition (EMEA means countries A, B, C) becomes a chance encounter with that pattern in training data. When the model has not seen the schema or the business rule exactly, it hallucinates.

The compiler target advantage

Colrows reframes the problem. SQL is not the endpoint. It is the intermediate representation (IR) of a compilation process. Business metrics and relationships are first-class objects in a semantic graph. When a question arrives, the compiler resolves it against that graph, validates every reference, proves join paths exist, and emits SQL as a governed, deterministic output. By making SQL the target of your semantic compiler, you guarantee accuracy regardless of how the underlying model changes. The model assists in understanding the question. The compiler ensures the answer is right. Link to the Colrows semantic compiler architecture to see the full workflow.

Do not prompt your way to success. Compile your way to reliability. Fix the context, not the model.

What a compiler actually does

A compiler translates source in one language into a target language through a fixed sequence of phases. The standard reference, Aho, Lam, Sethi, and Ullman's Compilers: Principles, Techniques, and Tools (the "Dragon Book"), lays them out:

Lexical analysis: turn characters into tokens.
Syntax analysis (parsing): turn tokens into a tree, according to a grammar.
Semantic analysis: type-check and resolve symbols. This is where a compiler catches "you cannot assign a float to an int" before anything runs.
Intermediate representation: a machine-independent form between the front end and the back end.
Optimization: rewrite the IR to run faster while preserving meaning.
Code generation: emit the target.

Three properties make compilers trustworthy. Determinism: the same source and inputs produce the same output, every time. Compiler teams treat non-determinism as a bug to be fixed. Verifiability: type checking and semantic analysis reject whole classes of errors loudly, at compile time, instead of letting them surface as wrong results later. Optimization: the compiler can apply transformations a human would not bother with, safely, because it understands the semantics.

There is a cost: you pay for compilation up front, before execution. But you pay once and amortize it across every run.

Why text-to-SQL is not a compiler

Text-to-SQL generates SQL by predicting likely tokens. It is sampling, not compilation. That has three consequences that no amount of scale removes.

It is not deterministic, even at temperature 0. In September 2025, Thinking Machines Lab sampled one prompt 1,000 times at temperature 0 from a 235B-parameter model and got 80 different completions; the outputs were identical for the first 102 tokens and first diverged at the 103rd. The cause was not random sampling; it was batch non-invariance in the reduction kernels.

It cannot type-check. The model does not resolve your question against a verified schema and metric graph; it pattern-matches against what SQL tends to look like. So it guesses joins, invents columns, and assumes definitions.

It fails silently. A syntax error crashes loudly and you fix it. A semantically wrong query runs successfully and returns a plausible number. You ship the dashboard. A week later someone notices revenue is off because the model joined on the wrong key.

The benchmarks make the gap concrete. On Spider 1.0, with clean academic schemas, DIN-SQL with GPT-4o reaches 85.3% execution accuracy and GPT-4-based methods hit 91.2%. Move to Spider 2.0, built from real enterprise databases with thousands of columns and multiple dialects, and GPT-4o solves only 10.1% and o1-preview 17.1% (Lei et al., 2024). The same GPT-4o that scored 86.6% on the toy benchmark collapses by 76.5 points to roughly 10% on enterprise reality. On the BIRD benchmark of real, dirty databases, GPT-4 reaches just 54.89% even with curated hints, against a human score of 92.96%. The accuracy cliff is real and well-documented.

SQL as the intermediate representation

Here is the reframe. SQL is not the human interface. It is the IR, sitting between business intent and physical execution exactly the way bytecode sits between Java source and the JVM.

intent → semantic layer → SQL (IR) → database planner → execution
(front-end compiler)(back-end compiler)

There are two compilers in this picture. The semantic layer compiles intent into SQL. The database planner compiles SQL into a physical plan. The second compiler is one of the most mature pieces of software humanity has built: cost-based optimization goes back to System R in 1979, became extensible with Graefe and McKenna's Volcano (1993) and Graefe's Cascades (1995), and now ships inside SQL Server, Greenplum's Orca, and the open-source Apache Calcite, which carries more than 100 optimization rules and can render one plan into many SQL dialects.

That maturity is exactly why SQL is the right target:

Declarative: you say what you want; the planner decides how. The front-end compiler does not have to pick physical operators.
Universal: Snowflake, BigQuery, Databricks, Redshift, PostgreSQL, DuckDB, and Trino all speak it.
Optimizable: four decades of planner research run for free under your emitted SQL.
Auditable: the SQL text is a readable artifact you can log, review, and replay.
Standardized: ISO/IEC 9075, versioned steadily from SQL-86 through SQL:2023, which was adopted in June 2023.

A semantic layer that targets a proprietary API would be locked in. One that targets SQL inherits the entire ecosystem and runs anywhere.

How a semantic layer compiles

A semantic SQL compiler runs the same phases as the Dragon Book, mapped onto data. The decisive difference is what happens on failure. A semantic compiler fails loudly: ask for a metric that is not defined, or a join path the graph cannot prove, and you get an error, not a guess. Text-to-SQL fails silently: it returns confident, runnable, wrong SQL.

Because the semantic engine generates the query deterministically, as dbt MetricFlow puts it, the model "can't produce an incorrect join or a bad aggregation." The payoff shows up in the numbers. dbt Labs' April 2026 benchmark compared the two approaches on a fixed question set: text-to-SQL scored 84-90%, while semantic-layer compilation scored 98.2% with Claude Sonnet 4.6 and 100.0% with GPT-5.3-codex on covered queries, because "the Semantic Layer's deterministic query generation means the LLM can't produce subtly wrong results."

A worked contrast

Ask: "What was active revenue in EMEA last quarter?"

Text-to-SQL might emit, plausibly:

SELECT SUM(amount) FROM orders
WHERE region = 'EMEA' AND status = 'active'
  AND order_date >= '2026-01-01';

Looks right. But which `amount`, gross or net? Does `active` mean `status = 'active'` or "not churned"? Is EMEA a region code or a roll-up of countries? Is "last quarter" the calendar or fiscal quarter? The model picked one reading of each, silently. Another run, or another server batch, may pick differently.

A semantic compiler resolves every one of those against the typed graph first. If `active_revenue` is defined, it compiles to the one correct SQL and logs the decision trail. If it is not defined, it stops with a compile error, not a silent wrong answer.

The honest part: determinism has limits

A rigorous reader will push back, correctly, that SQL is not perfectly deterministic. State the caveat plainly: floating-point addition is not associative. When a parallel engine sums a `FLOAT` column, the order of additions depends on how work was split across threads, so totals can differ in the low-order digits. Use `DECIMAL`/`NUMERIC` for money, and the problem largely disappears.

But notice the scale of the two problems. SQL's non-determinism is bounded, characterizable, and fixable (pick the right type, control parallelism). LLM non-determinism is unbounded across the space of plausible-but-wrong queries. Deterministic-at-a-fixed-version is a guarantee you can build controls on. "Usually returns something reasonable" is not.

Why this is the right model for regulated work

For a strategic buyer, the technical story converts directly into risk reduction. Every major compliance regime wants the same thing: a reconstructable chain from input to output. A semantic compiler produces that chain as a byproduct; a text-to-SQL system produces logs no regulator asked for.

SOX Section 404 requires auditable internal controls over financial reporting, a documented chain of custody from source to statement. A compiled SQL artifact, stamped with graph version, identity, and resolved definitions, is that chain.
HIPAA 45 CFR 164.312(b) requires mechanisms that "record and examine activity in information systems that contain or use" ePHI. Compilation logs deliver field-level, query-level traceability.
GDPR Article 22 gives people a right to "meaningful information about the logic involved" in automated decisions. A versioned semantic graph plus the compiled query is that logic, in inspectable form.
EU AI Act Article 12 requires high-risk AI systems to "technically allow for the automatic recording of events (logs) over the lifetime of the system," retained at least six months (Article 26(6)), with high-risk obligations applying from 2 August 2026.

Text-to-SQL logs tokens, prompts, temperatures, and retries. A semantic compiler logs the intent, the compilation decisions, the SQL, the data version, and the result. Only one of those is an audit trail.

What it costs, and what it saves

Compilation overhead is negligible: plan generation runs in milliseconds, and dbt reports Semantic Layer latency "in the order of milliseconds." The database does the real work, and the semantic layer can even improve it by pushing down predicates and pruning columns a generic planner would miss.

The cost contrast with agents is stark. Published cost tables for multi-step text-to-SQL put DIN-SQL at $0.7867 per query and DAIL-SQL at $0.1232 per query, each taking 12-16 seconds of API time. Deterministic compilation needs no LLM inference to build the SQL, so its marginal cost per query is effectively zero, and its compiled plans cache cleanly because they do not change run to run.

Where Colrows fits

Colrows is built as a deterministic semantic SQL compiler. It takes intent, whether a natural-language question or a structured agent request, compiles it against a typed semantic graph, and emits dialect-correct SQL for the target engine. It logs the whole chain: intent, resolved entities, proven join paths, generated SQL, and execution record. Same query, same graph version, same answer, with the proof attached.

Two things distinguish it. First, multi-dialect, cross-estate scope: unlike warehouse-native semantic layers that stop at the warehouse boundary, Colrows compiles across multiple sources without modification. Second, compile-time governance: row and column policies are injected during compilation, not bolted on at runtime.

And the honest limit: Colrows is a data-semantics compiler, not a general-purpose one. It is not trying to replace Python, Java, or hand-written SQL for application logic. It compiles business intent into governed, auditable, deterministic data queries. That is the job, and it is exactly the job that probabilistic generation cannot do safely.

The bottom line

SQL will not die. It will move down the stack and become what it was always best suited to be: the intermediate representation between meaning and execution. The semantic layer becomes the front-end compiler; the database planner stays the back-end compiler; and the artifact in the middle is deterministic, optimizable, and auditable. For builders, that is a cleaner architecture grounded in decades of proven theory. For buyers, it is the difference between an AI system you can defend to an auditor and one you merely hope is right. Those are the same decision.

Frequently asked questions

What does it mean to treat SQL as a compiler target?

SQL stops being the model's final output and becomes the intermediate representation between business intent and execution, the way bytecode sits between Java source and the JVM. The semantic layer is the front-end compiler that turns intent into SQL; the database planner is the back-end compiler that turns SQL into a physical plan.

Why is text-to-SQL not a compiler?

It predicts likely tokens instead of compiling. It is not deterministic even at temperature 0 (Thinking Machines Lab got 80 different completions from 1,000 runs of one prompt), it cannot type-check against a verified schema and metric graph, and it fails silently by returning runnable but semantically wrong SQL.

How accurate is text-to-SQL on real enterprise databases?

On Spider 1.0's clean academic schemas, DIN-SQL with GPT-4o reaches 85.3% execution accuracy. On Spider 2.0, built from real enterprise databases, GPT-4o solves only 10.1% and o1-preview 17.1%. On the BIRD benchmark, GPT-4 reaches just 54.89% even with curated hints, against a human score of 92.96%.

How much more accurate is semantic-layer compilation than text-to-SQL?

dbt Labs' April 2026 benchmark scored text-to-SQL at 84-90%, while semantic-layer compilation scored 98.2% with Claude Sonnet 4.6 and 100.0% with GPT-5.3-codex on covered queries. Deterministic query generation means the model cannot produce an incorrect join or a bad aggregation.

Is compiled SQL fully deterministic?

Almost. Floating-point addition is not associative, so parallel sums over FLOAT columns can differ in the low-order digits. Use DECIMAL or NUMERIC for money and the problem largely disappears. That non-determinism is bounded and fixable; LLM non-determinism is unbounded across plausible-but-wrong queries.

What does a semantic compiler cost per query compared to a text-to-SQL agent?

Compilation runs in milliseconds and needs no LLM inference to build the SQL, so its marginal cost per query is effectively zero and compiled plans cache cleanly. Published costs for multi-step text-to-SQL put DIN-SQL at $0.7867 per query and DAIL-SQL at $0.1232 per query, each taking 12-16 seconds.