A flat list of metric definitions transforming into a living concept graph with lineage, context, behaviour, and policy orbits.

From Metric Stores to Knowledge Machines

For the last decade, the industry believed it had solved the analytics problem. Metrics were centralised. Definitions were standardised. Dashboards were unified. Metric stores emerged as the "single source of truth."

And yet, the same questions keep coming back. Why does this number behave differently in different contexts? What changed upstream? Which downstream metrics are affected? Can I trust this figure for this decision?

The uncomfortable truth: metric stores solved storage, not understanding. They standardized how a number is calculated. They never captured what it means. That gap was tolerable when humans read dashboards. It is fatal now that agents query data directly.

What a metric store actually is

A metric store, or semantic layer, lets you define a business metric once, in code, and serve it consistently to every tool. The category converged fast. dbt's Semantic Layer is powered by MetricFlow, which dbt Labs acquired from Transform in February 2023 and which compiles YAML semantic models into SQL at query time. Cube is an open-source headless BI layer that exposes metrics through SQL, REST, and GraphQL. Looker encodes metrics in LookML. Snowflake ships Semantic Views that Cortex Analyst reads. Databricks ships Unity Catalog Metric Views.

They all have the same shape. Define a measure once. Separate the measure from the dimensions you slice it by. Let an engine generate the join and the aggregation at runtime. This was real progress. It killed a lot of duplicate SQL and gave teams one place to change a definition. But it was built on a narrow assumption: store the formula centrally and the rest takes care of itself.

Where the paradigm hits the wall

It does not take care of itself. A metric store knows how "revenue" is calculated and which column it reads. It does not know why the metric exists, when it should not be used, how it relates to business events, what it impacts downstream, or how its meaning shifts across finance, sales, and growth. "Revenue" means something different before refunds versus after, gross versus net, booked versus recognized.

The deeper problem is structural, and it shows up as four costs.

Metric stores standardize arithmetic, not meaning. They ensure everyone computes "revenue" the same way. They do not ensure everyone means the same thing by it. Teams "can agree on how a number is calculated while still disagreeing on what that number actually represents." Centralized formulas on top of inconsistent entities just produce consistent answers to the wrong question.

They are hand-authored and slow to evolve. Every new metric is a ticket. MetricFlow uses a single global namespace, so two domains cannot both define total_sales without a parse error. LookML is powerful but heavy: when warehouse schemas change more than twice a month, "LookML maintenance becomes a full-time job." Per Gartner analysis, organizations spend 40-60% of their total Looker investment on LookML development and maintenance.

Coordination cost compounds. Without central ownership of meaning, definitions diverge by team. A product team's "active users" (30-day window) reports 50,000 while marketing's (90-day window) reports 120,000, and leadership cannot tell whether adoption is growing or shrinking. Every new ungoverned dashboard adds another conflicting number.

The artifacts go unused. Self-service BI adoption frequently stalls below expectations, and a large share of users abandon dashboards for spreadsheets because the numbers do not reconcile. The metric store was supposed to fix trust. By itself, it does not.

Why this breaks now: agents do not read dashboards

A static metric registry could limp along when a human was the consumer. The human knew which metric to pick, applied judgment, and noticed when a number looked wrong. Agents do none of that.

And agents are arriving fast. Gartner predicts that 40% of enterprise applications will be integrated with task-specific AI agents by the end of 2026, up from less than 5% in 2025. The problem is that language models are terrible at enterprise SQL. On the academic Spider 1.0 benchmark, GPT-4o scores 86.6%. On Spider 2.0, built from real enterprise schemas, that success rate collapses to 10.1%, and even o1-preview solves only 17.1% of tasks. The failure mode is not syntax. It is semantics: the model does not know your schema, hallucinates joins, and misuses metrics.

That produces a uniquely dangerous outcome. A wrong dashboard is noticed eventually. A wrong agent answer arrives confidently, at scale, with an authoritative-sounding explanation. You get automated confident confusion.

Gartner has now put numbers on the fix. The firm predicts that by 2027, organizations that prioritize semantics in AI-ready data will increase their agentic AI accuracy by up to 80% and reduce costs by up to 60%. "Context with semantic coherence will become a cost-control and trust strategy, not a nice-to-have." Gartner advises establishing a context layer as a core component of data infrastructure, because traditional schema-based models "no longer suffice."

What a knowledge machine is

A knowledge machine does not store definitions. It reasons over them. Instead of answering "what is the SQL for this metric," it answers: is this metric appropriate for this question; which version applies in this context; what events could explain this change; what assumptions are baked in.

The substrate is a typed semantic graph, not a metric registry. Nodes are entities, metrics, events, concepts, constraints, and policies. Edges are typed relationships: ownership, dependency, causality. Unlike a property graph optimized for traversal, a knowledge graph prioritizes meaning, consistency, and inference, and it can derive implicit facts from explicit ones using an ontology as a schema layer. Versioning is bitemporal, tracking when a fact was true in the world and when the system recorded it, which gives point-in-time reproducibility.

This matters for AI accuracy, not just elegance. Grounding model reasoning in a knowledge graph consistently beats ungrounded baselines. In the widely cited data.world study by Sequeda and colleagues, GPT-4 writing zero-shot SQL against a raw enterprise database scored 16.7% overall execution accuracy, while the same model querying through a knowledge graph representation reached 54.2%—an improvement of 37.5 points, roughly triple the accuracy. Structure is what turns "sounds right" into "is right."

How agents consume the graph

The mechanics are where a knowledge machine separates from a metric store. A metric store answers "give me metric X grouped by Y." A knowledge machine answers "show me revenue impact from failed payments last week and explain what changed." One assumes you already know what to ask. The other helps you discover what matters.

The execution path is deterministic and runs in one order: resolve intent, prove the joins, enforce policy, then execute.

Resolution. When intent arrives in free text, every term is resolved to a grounded node. "Revenue" is not a string; it is a concept with a specific formula, source, and governance scope. Grounding each concept with multiple embedding vectors—one for the formal definition, one for observed usage, one combined—ensures that "real revenue" maps to the existing "net revenue after adjustments" rather than silently spawning a new dialect.

Join-path proof. This is the sharpest break from text-to-SQL. Rather than letting a model guess a join, the planner treats the join graph as typed edges (cardinality, foreign keys, grain compatibility) and runs a bounded search to prove a deterministic path exists. In the Colrows design, candidate paths are enumerated by breadth-first search up to a hop limit (default four), pruned by grain and cardinality constraints, then ranked; if no valid path exists, or the top path is not unique, compilation fails rather than emitting a wrong number. LLMs hallucinate joins; constrained planning cannot.

Compile-time governance. RBAC, ABAC, and row- and column-level predicates are injected into the SQL before any byte is fetched. Unauthorized plans are never generated, rather than being filtered after the fact.

Determinism. Identical intent, identical scope, and identical graph version produce identical SQL every time. Two agents asking the same question in the same scope get the same answer. RAG offers no such guarantee.

The discipline pays for itself. Colrows reports that it rejects, on average, 23% of AI-generated queries at the semantic compilation stage—queries that would otherwise burn warehouse compute and return wrong results. Refusing to answer is a feature.

Multi-agent reasoning and the feedback loop

Single answers are table stakes. The harder requirement is a shared, governed memory that multiple agents reason over and that improves with use. An agentic analytics architecture has a reasoning layer, a semantic layer, an action layer, and a feedback loop that captures outcomes so the system continuously learns.

Colrows operationalizes that loop as "emergent consensus." A new pattern is observed, say payment failures often precede churn. An inference agent proposes the relationship PaymentFailed -> ChurnRisk. A validation step checks whether the correlation is statistically significant and whether it conflicts with existing logic. A governance step applies scope rules, for example allowing it for internal analytics but not external reporting. A version is created. As more queries depend on it, confidence rises and the relationship becomes canonical. No one filed a modeling ticket.

The graph is built the same way it is maintained: by observation, not authoring. Colrows ingests database schemas and DDL, dbt models, catalog exports, existing semantic layers (dbt YAML, LookML, Power BI), business glossaries and wikis through an NLP pipeline, and historical query logs, from which it learns how different personas phrase the same intent. Over time this becomes an enterprise-specific semantic fingerprint that raises both accuracy and switching cost.

Then it watches for decay. An autonomous subsystem monitors five categories of drift: schema (columns added or removed), distribution (value distributions shift), definition (an "active customer" threshold changes from 30 to 90 days), relationship (a foreign key is dropped), and policy (access rules change). Breaking changes invalidate the dependent paths and route a proposed fix for human approval in-product. "Data doesn't rot on its own. Meaning does." The job is to make change visible before drift becomes debt.

Every query, finally, is its own audit record. It captures the graph version, identity context, resolved entities, proven join paths, and the exact compiled SQL, so a historical query can be replayed with the definitions in force at that moment. In a regulated environment where audits arrive months later, the runtime is the audit.

The competitive landscape, honestly

The category is real and the incumbents are good at what they do. The distinction is what layer they serve.

dbt Semantic Layer (MetricFlow) keeps meaning versioned alongside transformations and compiles metric requests to SQL. It is the natural fit if your logic already lives in dbt. It is hand-authored YAML, refreshed by humans, and exposes a metric API, not a reasoning graph.

Cube is excellent plumbing: an open, headless metrics API for humans and embedded apps, with mature pre-aggregation and caching. Cube is now shipping an agentic interface ("Cube D3"), a sign the whole category is moving toward agents. Its model is metrics-as-API, with schema authoring manual and governance delegated to the warehouse.

Looker is presentation-time semantics. A user picks dimensions and clicks Run, and LookML resolves the click into SQL. It is strong governance for human dashboards, but LookML is a proprietary language with a heavy maintenance burden, and its governance does not extend to agents querying through other channels.

Snowflake Semantic Views and Databricks Metric Views are warehouse-native and well integrated, but they stop at the warehouse boundary. An enterprise semantic layer has to span an estate that includes more than one warehouse.

The industry itself is conceding that definitions must be shared and portable. In September 2025 Snowflake launched the Open Semantic Interchange (OSI) with Salesforce, dbt Labs, BlackRock, Cube, and others to standardize a vendor-neutral semantic specification, and dbt Labs open-sourced MetricFlow under Apache 2.0. That is the right direction. But a portable definition format standardizes the noun. It does not give you the verb: deterministic, governed compilation of agent intent.

Where Colrows fits

Colrows is a semantic execution layer, not a metric store. It autonomously builds a typed, versioned semantic graph across the data estate, then compiles every agent query through it into governed, deterministic, dialect-perfect SQL, with joins proven, policy injected at compile time, and a reproducible audit trace. It compiles to optimized SQL for 16-plus engines including Snowflake, Databricks, Redshift, BigQuery, and Postgres. It is used by engineering teams at Pfizer, Cipla, BTS Group, Flobiz, and Brexa.

The honest framing: Colrows is not trying to be your dashboarding tool, and it is complementary to several incumbents at the data layer. Many enterprises run Colrows as the agent-execution layer while keeping Looker for human dashboards or dbt for transformations. The divergence is at the application layer. Where a metric store stops at metric values, Colrows operates on meaning. Where a metric store is hand-authored, Colrows self-maintains. Where text-to-SQL guesses a join, Colrows proves it or refuses.

A fair caveat for buyers: the performance figures here (sub-100ms p99 resolution, 0.91 first-result accuracy on benchmark datasets, 23% query rejection) are Colrows' own reported numbers, not independently audited, and the autonomous-graph category is young. Gartner's own counsel applies: agentic analytics can be overkill where compliance is heavy, ROI is low, or data integration is immature, and Gartner predicts over 40% of agentic AI projects will be canceled by the end of 2027 due to escalating costs, unclear business value, or inadequate risk controls. The right move is a scoped pilot with measurable accuracy and governance benchmarks before you scale.

The strategic case

Strip away the architecture and the decision is simple. The metric store was built for a world where humans read curated dashboards. That world is ending. The new consumer is an agent that generates intent in language and needs that intent compiled into a trustworthy, governed answer, deterministically, every time.

In that world the bottleneck is no longer the model. It is context. The enterprises that win will not be the ones with the largest metric catalogs. They will be the ones whose systems can reason across meaning, learn from usage, and refuse to guess. From metrics as numbers, to metrics as knowledge, to systems that can think with them.

Ship AI you can trust enough to put in production.