The Semantic Layer Evaluation Checklist: 40 Questions to Ask Before You Buy

Every vendor demo answers the questions the vendor chose. This checklist is the other set. Forty questions across seven dimensions, written to be pasted into an RFP or worked through on a vendor call - each one falsifiable, most with the follow-up that exposes a soft answer. It is the procurement-instrument companion to our Semantic Layer Buyer's Guide: that piece explains the market; this one is the clipboard you take into the meeting.

How to use this

Three rules make the checklist work. First, demand demonstrations, not assertions. "Yes, we have audit logs" is an assertion; "here is the audit record for this answer, now reproduce the answer from it" is a demonstration. Second, run every accuracy question against your schemas, not the vendor's demo dataset. The published record is unambiguous on why: frontier models score 86-91% on the clean academic Spider 1.0 benchmark and 10-21% on Spider 2.0's real enterprise tasks - the demo dataset is the easy half of the exam (the full cliff, with sources). Third, weight the dimensions for your risk profile before the first call, so a charismatic demo can't re-weight them for you. A regulated BFSI buyer might weight governance and audit at half the total; a startup analytics team might weight cost shape and time-to-value.

Disclosure, because it shapes the document: Colrows is our product, and we sell against this checklist. We have kept every question vendor-neutral - they are the questions we believe any serious buyer should ask everyone, including us.

Dimension 1: Who builds the semantic model? (Questions 1-6)

The semantic model is where natural-language accuracy comes from - and in most products, it is hand-built. This is the largest cost in the deployment and the least likely to appear in the quote.

  1. Who authors the initial semantic model - your software or my team? Follow-up: in your three most recent reference deployments, how many person-weeks did initial modeling take?
  2. What does the model ingest automatically? Warehouse schemas only, or also catalogs (Alation, Atlan, Collibra), BI metric stores (Power BI, dbt), and documentation (Confluence, wikis, PDFs)?
  3. How does the model represent relationships - joins, hierarchies, synonyms, business definitions? A flat list of metrics is not a semantic model; multi-hop questions need a graph.
  4. Can the model hold different meanings for different scopes? "Revenue" can legitimately differ between global finance and a regional sales team - ask how the product scopes definitions (Colrows's graph, for reference, is multi-scope: global → datastore → persona → user).
  5. Is the model versioned? Follow-up: show me a diff between two versions, and roll one back.
  6. What happens to questions the model cannot resolve? The two honest answers are "loud failure" or "best-effort guess, flagged." The dangerous answer is a confident guess, unflagged - the failure mode documented across probabilistic text-to-SQL.

Dimension 2: Who maintains it? (Questions 7-11)

Schemas change, metrics get redefined, sources get added. A semantic model that was right in March and unmaintained since is a liability that looks like an asset - we wrote up the decay mechanics in Knowledge Drift and Semantic Decay.

  1. How does the product detect that the warehouse has drifted from the model? Statistical drift detection and structural diffing, or a human noticing wrong numbers?
  2. When drift is detected, who fixes it? Autonomous maintenance with review, or a ticket to my data team?
  3. What is the vendor-recommended ongoing staffing for the semantic model? Get it in writing; this is the recurring line item that procurement data never captures.
  4. How are conflicting definitions resolved when two sources disagree?
  5. When a schema change breaks a definition, what do users see? An error naming the broken concept, or silently wrong numbers until someone audits?

Dimension 3: Accuracy and determinism (Questions 12-17)

This is the dimension to test, not discuss. Bring 30-50 questions with known correct answers against your own data.

  1. If I ask the same question twice, do I get the same SQL? The single highest-signal question in the document. Microsoft, for example, documents Power BI Copilot's outputs as nondeterministic; vendors whose architecture compiles rather than generates can answer yes flatly.
  2. Show me the exact query behind this answer. If the answer is a narrative without an inspectable query, see our list of tools that show their SQL for why this is disqualifying for audit-bound deployments.
  3. How are joins chosen - proven against known paths, or inferred per query? A join path proof is checkable; an inference is a probability.
  4. What is your measured accuracy, on what benchmark, and who ran it? Vendor-internal numbers (Snowflake's 90%+ on 150 internal questions, Databricks's 32→90% curation arc) are informative but not comparable; your POC is the benchmark that matters.
  5. In the POC, count loud failures separately from wrong answers. "I cannot resolve this" is a safety feature; a fluent wrong answer is the most expensive output the system can produce.
  6. Test ambiguity deliberately. Ask "what was revenue last quarter?" where two definitions of revenue exist, and watch whether the system asks, picks silently, or fails loudly.

Dimension 4: Governance (Questions 18-23)

The architectural question underneath all six: is policy enforced before the query runs (compile time) or after the data is fetched (render time)? The difference is whether a policy bug is an incident or a near-miss - argued in full in Governance as Code → Governance as Semantics.

  1. Where is access policy enforced - in the generated query plan, in the warehouse, or in the presentation layer?
  2. Does the product support RBAC, ABAC, and row/column-level predicates - and can one answer combine all three?
  3. Show me the same question asked by two users with different entitlements. The SQL itself should differ, not just the rendered rows.
  4. Can a policy change be applied without rebuilding semantic definitions?
  5. How do AI-agent queries inherit governance? If agents use a service account with broad read access, every agent answer is a potential exfiltration path.
  6. Whose credentials execute a shared query? Ask specifically - at least one major platform documents that shared AI spaces can run on the authoring user's credentials.

Dimension 5: Audit and reproducibility (Questions 24-28)

For regulated buyers this dimension is pass/fail. The bar - drawn from RBI, Federal Reserve, and EU AI Act expectations - is documented in Auditable SQL for Regulated Industries.

  1. For any historical answer, can you produce: the question, the resolved meaning, the exact SQL, the policies applied, and the data version?
  2. Is the answer point-in-time reproducible? Follow-up, the demonstration version: reproduce this answer from last month's audit record, now.
  3. Are the semantic definitions themselves versioned in the audit trail? "Revenue" may have been redefined since the answer was given; the trail must capture which definition applied.
  4. Can audit records be exported to my SIEM/GRC tooling?
  5. What does the auditor see for an AI-agent query versus a human query? The honest target is: no difference.

Dimension 6: AI-agent readiness (Questions 29-34)

If the roadmap includes agents - and every 2026 roadmap does - the semantic layer is the context they will stand on. A layer built only for dashboards will need replacing; the criteria are expanded in Best Semantic Layer for AI Agents.

  1. Is there a programmatic interface - API, MCP - that gives agents the same governed access humans get?
  2. Can an agent's query be constrained by the agent's own entitlements, distinct from its operator's?
  3. What context does the layer return besides data - definitions, lineage, confidence, provenance? Agents compose answers; they need the metadata, not just the rows.
  4. How does the layer handle multi-hop agent questions that no dashboard anticipated? Static metric stores stop here; graphs with proven join paths do not.
  5. What stops an agent from issuing a runaway query? Cost guards at compile time, or the warehouse bill three weeks later?
  6. Does agent traffic share the audit trail (see Q28) and the governance model (see Q22)? Circle back deliberately - vendors who answered those well for humans sometimes carve agents out.

Dimension 7: Cost shape (Questions 35-40)

Not "what does it cost" - "what shape does the cost take as we scale?" The published-price layer of this market is thin and the meters are creative; our teardowns of ThoughtSpot, Power BI Copilot, and Looker pricing supply the worked examples.

  1. What is metered - seats, queries, capacity, tokens, rows? Each meter taxes a different behaviour; know which behaviour of yours grows fastest.
  2. Is the AI feature gated behind a platform tier? Power BI Copilot requires paid Fabric capacity (F2 from $262.80/month) regardless of per-user licenses; Looker meters Gemini Data Tokens with overage billing from 1 October 2026.
  3. What happens when the meter runs out mid-month? Throttling, overage billing, or - as Fabric documents it - operations shutting down until the capacity recovers?
  4. What did implementation cost in your last three reference deployments? Enterprise BI procurement data puts implementation at 15-40% of first-year subscription value.
  5. What is the all-in modeling labour - initial build plus annual maintenance - at my schema scale? This is Dimension 1 and 2 priced in headcount; for most hand-modeled products it exceeds the license.
  6. What does the renewal look like? Ask for the median uplift across the current customer base, in writing.

The one-page scorecard

DimensionQuestionsPass looks likeWalk away if
1. Model construction1-6Autonomous build from warehouse + catalogs + docs; versioned, scoped graph"Your team authors the model" with no person-week estimate
2. Maintenance & drift7-11Drift detection + autonomous maintenance; loud breakageMaintenance is a ticket queue; breakage is silent
3. Accuracy & determinism12-17Same question → same SQL; inspectable queries; loud failureNondeterminism is documented and unmitigated
4. Governance18-23Compile-time RBAC + ABAC + row/column predicates; agents inherit policyPolicy lives in the presentation layer
5. Audit & reproducibility24-28Point-in-time reproducible answers; versioned definitions in the trail"We have logs" without a reproduction demo
6. AI-agent readiness29-34Programmatic governed access; multi-hop resolution; cost guardsAgents ride a broad-access service account
7. Cost shape35-40Meters you can forecast; modeling labour quoted in writingThe biggest line item (modeling) is "not our scope"

Weight before the first call. A suggested split for regulated buyers: dimensions 3-5 at 50%, 1-2 at 30%, 6-7 at 20%. For a growth-stage analytics team: 1-3 and 7 at 70%.

Where Colrows sits, in one paragraph

Since we wrote the checklist, here is our own entry, stated so you can test it rather than trust it. Colrows is a semantic execution layer: the semantic graph is built and maintained autonomously (dimensions 1-2 - drift detection, structural diffing, conflict resolution, no curation team), every question is compiled - intent → context resolution → constrained planning → governed execution - so the same question yields the same dialect-perfect SQL with a join path proof (dimension 3), governance is compile-time RBAC + ABAC + row/column-level predicates (dimension 4), every answer is point-in-time reproducible from the audit trail (dimension 5), and agents consume the same governed pipeline humans do (dimension 6). Cost shape: a free tier to run the POC on, Enterprise custom (dimension 7 - ask us questions 38-40; we expect them). The 40 questions above are how we would want to be evaluated.

Frequently asked questions

What are the most important evaluation criteria?

Seven dimensions: who builds the model, who maintains it, accuracy and determinism, governance, audit and reproducibility, AI-agent readiness, and cost shape. The weighting is yours; the questions are above.

What should go in the RFP?

The falsifiable subset: same-question-twice determinism (Q12), the reproduce-an-answer demonstration (Q25), reference-deployment modeling headcount (Q1, Q9, Q39), and the meter-exhaustion behaviour (Q37). Demand demonstrations over assertions throughout.

How do you test accuracy claims?

A POC on your schemas with 30-50 known-answer questions, each asked twice; count loud failures separately from wrong answers. The Spider 1.0 vs 2.0 gap (86-91% vs 10-21%) is the published proof that demo-dataset performance does not transfer.

What hidden costs recur across the market?

Modeling labour (build + maintain, rarely quoted), platform gates and meters (Fabric capacity, Gemini Data Tokens), and implementation services at 15-40% of first-year subscription value. Questions 35-40 surface all three.

A note on the claims

Benchmark figures (Spider 1.0/2.0), vendor caveats (Microsoft's nondeterminism documentation, Databricks's credential note, Fabric's exhaustion behaviour), and pricing facts (Fabric F2, Gemini Data Token overage dates, implementation percentages) are sourced in the linked deep-dives, all checked 11-12 June 2026. Colrows is our product, and the disclosure above applies to the whole document: these are the questions we ask buyers to ask everyone.

Bring the checklist. We'll bring the demonstrations.