The Hook: Three Places Authorization Can Live, and Only One Is Early Enough
Authorization in a data estate can sit in one of three places, and the choice quietly determines your blast radius. Put it in the database, and it is fast and native but trapped inside one engine. Put it in the BI layer, and it works for human analysts in a dashboard but executes after the warehouse has already read the data — and any agent or SQL client walks straight past it. Put it in the semantic layer, between your data sources and every consumer, and it is evaluated at compile time, before a single row is read.
The stakes are not theoretical. The 2017 Equifax breach — which exposed data on roughly 147 million people — led to a settlement of "at least $575 million, and potentially up to $700 million," with "up to $425 million in monetary relief to consumers, a $100 million civil money penalty." The breach was rooted in access-control and patching failures. The 2024 Snowflake customer-credential campaign (UNC5537) is the more modern lesson: Mandiant's report states "approximately 165 potentially exposed organizations," that "79.7% of affected accounts had prior credential exposure," and "impacted accounts were not configured with multi-factor authentication enabled." Attackers used stolen credentials to authenticate directly to customer tenants and extracted data using normal database operations; in late May 2024, UNC5537 began advertising data from Ticketmaster and Santander for sale. No platform vulnerability was exploited.
The common thread is the trusted-intermediary problem: once a query runs under a credential the database trusts, the database has no idea who the real user is or whether they should see the result. Broken access control is, per OWASP Top 10:2021, the #1 web application security risk: its A01 entry reports that "94% of applications were tested for some form of broken access control with the average incidence rate of 3.81%."
Section 1 — The Layered Stack and Where Authorization Has Lived
The traditional analytics stack is storage (databases/warehouses) → transformation (dbt, SQL) → BI/presentation (Looker, Tableau, dashboards). The modern, AI-era stack inserts a new tier: storage → semantic layer → agents / APIs / BI.
Historically authorization has lived in one of three spots:
- Database-native: Snowflake masking, BigQuery column-level security, Postgres RLS.
- BI-layer: Looker access filters / user attributes, Tableau RLS.
- Ad hoc SQL filters baked into transformation models.
Each has a fatal flaw on its own. Database-native controls cannot reach across the warehouse boundary or out to agents and APIs. BI-layer controls fire too late and are bypassed by anything not going through the BI tool. Ad hoc SQL filters are invisible to audit — if the filter has a bug or is omitted, nothing flags it. A semantic layer enforces authorization at compile time — before any data is touched — sitting "between users, AI agents, and your data systems," resolving meaning at compile time and running "dialect-perfect SQL inside a policy-shaped subgraph."
Section 2 — The Database-Layer Limitation
Database-native enforcement is excellent at what it does — and what it does is bounded by one engine.
- Snowflake Dynamic Data Masking rewrites column values at query time based on role. It is powerful within Snowflake, but it is a Snowflake-only construct requiring Enterprise Edition. It cannot mask data that lives in Databricks or BigQuery. Critically, masking did not protect customers in the 2024 credential attacks because attackers held credentials that saw unmasked production data.
- BigQuery column-level security uses Data Catalog policy tags and is scoped to BigQuery/GCP. It does not reach a Snowflake or Postgres table.
- PostgreSQL RLS injects row predicates within Postgres only, with limited ABAC expressiveness.
In a multi-source estate the consequences compound. Policies must be authored separately in each system. When a user is terminated, access must be revoked in Snowflake and Databricks and BigQuery — miss one and the user retains access. The audit question "who accessed customer data across the estate?" requires querying three different audit logs and reconciling them by hand.
Section 3 — The BI-Layer Trap: Why Authorization Is Structurally Too Late
Here is how BI-layer authorization actually works. A user opens a dashboard. The BI tool generates SQL and runs it against the warehouse — typically under a shared service credential, not the end user's identity. The warehouse executes the query, reads the data, and returns results. Then the BI tool applies row-level security: filtering rows, masking columns, applying access filters.
The architectural problems follow directly:
- The data is already read. By the time the BI tool filters, the warehouse has scanned and returned the rows. The exposure has already happened on the wire and in the warehouse's memory and logs.
- Identity is lost. The warehouse logs "query executed by
looker_service_account," not "query executed to answer a question for User X." This is the confused-deputy problem: the warehouse can't enforce per-user policy because it never sees the user. - It is bypassable. Google's Looker documentation is explicit: "If users have access to raw SQL (through SQL Runner or the API), they can bypass Explore access filters." If your AI agent queries the database directly, every row-level security policy in your Looker instance is silently bypassed.
- It is a performance tax. Because filtering happens after the read, the warehouse cannot push authorization predicates down to prune data it should never have touched.
- It is hard to audit. If the BI-layer masking has a bug or is inconsistently applied, the warehouse audit log won't show it — the query just looks like a normal service-account read.
The canonical failure: an analyst sees correctly masked revenue-by-region in a dashboard, then opens a SQL client or API against the same warehouse and sees global revenue unmasked. The BI-layer control simply did not apply.
Section 4 — Why AI Agents Broke the BI-Layer Model
The BI layer assumed a human analyst with training, judgment, and a tool in front of them. AI agents violate every part of that assumption. They query the warehouse programmatically — increasingly over the Model Context Protocol (MCP) — with no BI layer in the path to apply masking. Agents reason directly over the raw schema, inferring relationships that are implicit and undocumented.
This is dangerous for two compounding reasons. First, agents are unreliable on raw enterprise schemas. The BEAVER benchmark found that GPT-4o and Llama3-70B-Instruct "achieved close to 0 end-to-end execution accuracy," concluding that "off-the-shelf LLMs trained on public datasets are unable to generalize." Spider 2.0 tells the same story: "even the advanced LLMs—o1-preview solve only 17.1% of Spider 2.0 tasks. For widely-used models like GPT-4o, the success rate is only 10.1%." Agents hallucinate non-existent tables, invent joins, and omit filters.
Second, and worse for security: if authorization lives only at the BI layer, the agent has no authorization constraint at all. It reaches the warehouse under a service identity and reads whatever that identity can see. Microsoft's Fabric Data Agent has documented exactly this — RLS configured only in the semantic model was not enforced when the agent queried under its service identity, "resulting in all data being returned."
Section 5 — The Semantic-Layer Solution
A semantic layer is an explicit, typed model of meaning — entities, relationships, metrics, constraints, and access policies — sitting above storage and below every consumer. The defining property for authorization is when enforcement happens: during query planning, before any warehouse query is generated.
The pipeline:
- An agent (or BI tool, or API caller) submits intent — natural language or structured.
- The semantic layer resolves that intent against the graph: which entities, relationships, and metrics does it require?
- Authorization is checked. Personas, scopes, and policy nodes shape an allowed subgraph before any plan is generated. If a metric depends on a node outside that subgraph, "resolution fails - there is no query to run."
- If authorized, dialect-perfect SQL is compiled and executed; if not, compilation fails with a clear reason.
Contrast with post-hoc enforcement: the BI layer reads then filters; the semantic layer fails during planning, so the unauthorized read never happens. Enforcement is "structural rather than procedural." Every compiled query produces a trace capturing "graph version, identity context, resolved entities, proven join paths, and compiled SQL," enabling point-in-time reproducibility. Gartner elevated the semantic layer to essential infrastructure, stating: "By 2030, universal semantic layers will be treated as critical infrastructure, alongside data platforms and cybersecurity."
Section 6 — Architectural Comparison
| Dimension | Database-native | BI-layer | Semantic-layer |
|---|---|---|---|
| Scope | Single warehouse only | Single BI tool's users | Multi-warehouse (16+ engines) |
| Enforcement latency | At query execution (data read) | After read, before display | At compile time (before read) |
| Agent-ready? | Partial (if agent uses warehouse identity) | No — agents bypass entirely | Yes — agents query the governed graph |
| Cross-warehouse consistency | No — re-authored per engine | No | Yes — one policy set |
| Audit | Per-engine logs, service-account noise | Hard to prove masking applied | Reproducible per-query trace |
| Performance | Native, fast within engine | Tax: reads then filters | Predicate pushdown of authorized subgraph |
The honest trade-off: database-native is the fastest and simplest within one engine; semantic-layer enforcement requires building and maintaining a semantic model, but it is the only option that is simultaneously multi-warehouse, agent-ready, and audit-ready.
Section 7 — Multi-Source Enforcement and the Revocation Problem
Consider a user with access to revenue in Snowflake but not in Databricks. Under database-native enforcement you maintain two separate policies in two systems. When the user is terminated, you must revoke in both — and in BigQuery, and anywhere else. Miss one and the breach is latent. The audit question "show me everyone who accessed customer data last quarter" becomes a multi-day reconciliation across three audit logs with three schemas. The semantic-layer answer collapses this to one policy, one revocation, one audit trail, enforced identically before any engine is touched. Fine-grained data access control mechanisms (RBAC, ABAC, row/column predicates) are what the semantic layer compiles; the layer is where those mechanisms are evaluated.
Section 8 — The "Authorization at the API Layer" Trap
A common intermediate pattern is to build REST APIs and embed authorization checks in API code. It does not scale: every API is a separate authorization implementation, policies diverge across services, testing is hard, and enforcement is opaque to the user. Agents become "a confused-deputy problem at industrial scale" when a single service identity serves many users. The semantic-layer alternative makes authorization declarative and centralized — one policy model, evaluated structurally — rather than scattered across hand-written endpoint code.
Section 9 — Build vs Buy for Semantic-Layer Enforcement
You can build this. The canonical DIY stack is Open Policy Agent (OPA) + custom SQL templating + a query parser: OPA decides allow/deny and can emit row filters as WHERE-clause fragments, while your own code handles parsing, join graphs, and dialect generation.
The expensive part is everything around the policy decision. The hidden cost of building your own data access layer is a from-scratch data access layer pegged at "easily 6-12 months of work by senior engineers" just to hit baseline parity, with a representative in-house TCO of two backend engineers for 12 months at $300,000+, DevOps at $100,000+, and ~$50,000/year ongoing maintenance. The recurring tax is dialect handling (Snowflake SQL vs Postgres vs BigQuery), join-path correctness, and keeping policies valid as schemas drift.
Decision rule: build if you have genuinely unique policy semantics and the standing engineering capacity to own a security-critical system indefinitely. Buy if you want multi-dialect support, audit trails, and semantic maintenance delegated, and you want to ship in weeks. Colrows reports datasource connection and initial graph build in "hours, not weeks," and production rollouts in regulated environments "in weeks, not months."
Section 10 — Competitive Landscape
- Snowflake / BigQuery / Postgres: database-native masking and RLS — fast, native, single-engine; do not reach agents/APIs or other warehouses.
- Looker / Tableau: BI-layer RLS — works for human analysts in-tool; late and bypassable.
- dbt: a transformation/metrics layer. Its semantic layer standardizes metric definitions and respects the data-platform credential's policies, but its scope "covers metric calculation but not broader entity definitions, access governance, or AI-agent APIs."
- Atlan / Collibra / Alation: governance and metadata catalogs. They document ownership, lineage, and classification, but policy enforcement "depend[s] on manual stewardship."
- Colrows: semantic execution layer — multi-warehouse, compile-time enforcement, per-query audit trace.
- Custom OPA + SQL: maximum control, maximum maintenance.
Section 11 — CTO Decision Framework
- Single warehouse, human analysts only: database-native (plus BI-layer for convenience) is acceptable.
- Two or more warehouses: per-warehouse enforcement becomes untenable — drift and revocation risk grow with each engine. Evaluate semantic-layer enforcement.
- AI agents in production: BI-layer authorization is bypassed by design. Semantic-layer enforcement becomes effectively mandatory.
- Regulatory audit intensity (HIPAA, GDPR, financial services): compile-time, reproducible audit trails move from nice-to-have to required.
Thresholds that change the recommendation: the moment you cross to 2+ warehouses or put any agent in production against real data, move authorization to the semantic layer. If you remain single-engine and human-only, you can defer.
Section 12 — Implementation Roadmap
- Greenfield: design the semantic layer from day one; point all consumers (BI, APIs, agents) at it.
- Brownfield: add the semantic layer as an overlay. Repoint agents and APIs to it first (they are the highest-risk, lowest-friction path), keep BI tools connected to the warehouse for backward compatibility, then migrate dashboards onto the governed layer.
- Hybrid: keep performance-critical, low-sensitivity queries database-native; route governed/audited/cross-warehouse queries through the semantic layer.
