That gap is not cosmetic. Every dashboard you own reports correlation, and almost every decision you make needs causation. "Churn went up" is a correlation. "The April price change caused churn to go up" is a causal claim, and it is the only kind of claim a budget meeting actually wants. The data to tell them apart is sitting in your event tables right now. The causal structure that would let you tell them apart is not stored anywhere. It lives in an analyst's head and in 70 lines of SQL nobody can reproduce.
This post is about closing that gap: what causal structure actually is (Pearl-grade, no hand-waving), why events are the natural place it already lives, and how a semantic layer can capture, version, and eventually enforce it.
The gap: dashboards show correlation, decisions need causation
Judea Pearl, who won the 2011 Turing Award for this work, organizes the gap into a Ladder of Causation with three rungs.
- Rung 1, Association (seeing). P(Y | X). "Customers who saw the win-back email reactivated more often." Every BI tool and most ML models live here.
- Rung 2, Intervention (doing). P(Y | do(X)). "If we send the win-back email, how many more customers reactivate?" The
dooperator means you reach in and set X, breaking whatever used to determine it. - Rung 3, Counterfactual (imagining). "This churned customer got the email and left anyway. Would they have stayed if we had not sent it?"
The jump from Rung 1 to Rung 2 is where decisions are made and where money is lost. A dashboard cannot make that jump on its own, because P(Y | X) and P(Y | do(X)) are different quantities whenever a confounder is in play. Which, in an enterprise, is always.
Confounders, in one example you already believe
The textbook case: shoe size correlates with reading ability in children. Bigger feet, better reading. Nobody thinks buying bigger shoes makes a child read. Age causes both. Age is a confounder, and the path "reading ability ← age → shoe size" is a backdoor path that manufactures correlation with zero causation.
The enterprise version is everywhere. You email a win-back offer to churned customers and observe that recipients reactivate more. Did the email cause it? Maybe. But you targeted the email at a segment, and segment predicts both who got the email and who was going to reactivate anyway. Segment is the confounder. The naive "reactivation rate among recipients" is the shoe-size number.
The most famous reversal is the 1973 UC Berkeley graduate admissions case. Aggregate data showed 44.2% of male applicants admitted (3,738 of 8,442) versus 34.6% of female applicants (1,494 of 4,321), a statistically significant gap that looked like bias. Broken down by department, women had equal or slightly higher admission rates in most departments. Women applied in greater numbers to more competitive departments with lower admission rates overall. Department was the confounder, and conditioning on it reversed the conclusion (Bickel, Hammel & O'Connell, Science, 1975). This is Simpson's paradox, and it is not a curiosity. It is what happens to any metric you aggregate without knowing the causal structure underneath.
The mechanics, stated precisely
You do not need to mystify this. The machinery is finite.
DAG (directed acyclic graph). Nodes are variables. A directed edge X → Y means X is a direct cause of Y. Acyclic means no loops (cause precedes effect). A DAG is a set of assumptions about the world, written down so they can be argued with.
Backdoor path. Any path from treatment X to outcome Y that starts with an arrow into X. These carry non-causal association. X ← C → Y is the simplest.
Backdoor criterion (Pearl). A set of variables S identifies the causal effect of X on Y if (1) no variable in S is a descendant of X, and (2) S blocks every backdoor path from X to Y. Condition on the right S and the correlation that remains is causation. This is the whole game: find the adjustment set.
Do-calculus (Pearl, 1995). Three rewrite rules that turn an interventional expression P(Y | do(X)) into something you can compute from observational data, given the DAG. Backdoor and frontdoor adjustment are special cases. If the effect is identifiable, do-calculus produces the formula; if it is not, do-calculus tells you that too.
Colliders and the trap of over-controlling. A collider is a node with two arrows into it: X → Z ← Y. Conditioning on a collider, or on anything downstream of your treatment, opens a spurious path and biases your estimate. This is post-treatment bias, and it is counterintuitive: adding a control can make the answer worse. The "obesity paradox" and the "birth-weight paradox" in epidemiology are both collider artifacts. Any system that claims to reason causally has to know which variables are safe to condition on and which are poison.
Why RCTs are the gold standard. Randomize the treatment and there is no arrow into it, so every backdoor path is broken by construction. That is the entire magic of an A/B test. Observational causal inference tries to earn the same guarantee by naming and measuring every confounder, which is cheaper, faster, and far easier to get wrong.
Events are where causality already lives
Here is the part that matters for builders. You do not have to bolt causality onto your stack from outside. Your event-driven architecture is already a causal substrate. You just throw the structure away on write.
An event is "something happened": an immutable fact with a timestamp. A trigger is "in response, do this." A causal chain is event → state change → downstream consequence. Event sourcing already commits to this worldview: state is an append-only log of immutable events, and current state is a replay. A workflow or state machine is a literal causal model: state A + event E → state B. Approval flows, order-to-cash, clinical decision support are all implicit causal graphs.
So why can't you query them as causal graphs? Because when events land in the warehouse, every event sits next to every other event and the cause column is NULL. The data is there. The logic is not.
A caution builders will appreciate, because it kills a tempting shortcut. Timestamps are necessary for causality but not sufficient. X can only cause Y if X precedes Y, but "X's timestamp is earlier" does not prove causation. In a distributed system, wall clocks drift, and Lamport's happened-before relation (1978) only gives a partial order: if A causally precedes B then LC(A) < LC(B), but the converse fails for concurrent events. Kafka guarantees ordering only within a partition. So you cannot infer the causal graph from timestamps alone. You have to capture it.
Modeling events as causal graph entities
This is the Colrows position, and it is deliberately modest about what is solved today.
An event becomes a typed entity in the semantic graph, first-class like a Customer or an Order. It has a name (price_change, churn_risk_flag), an owner, a typed payload, and typed relationships to other events:
event churn_risk_flag {
caused_by: price_change(account) # upstream trigger
triggers: retention_offer # downstream consequence
part_of: q2_retention_campaign # enclosing workflow
contradicts: reactivation_event # supersedes / invalidates
edge_provenance: declared | workflow | inferred
}
Once those edges exist as graph relationships, the planner can traverse them. "Show me churn-flagged accounts where the upstream event was a price change in the last 60 days" stops being a 70-line SQL query with a hand-guessed time window and becomes a typed graph traversal that compiles into one.
Where does the cause column come from? Three sources, in descending fidelity.
- Producer-declared. The service that emits the retention action also emits
caused_by: churn_risk_flag(account=...). Strongest signal. Worth investing in. - Workflow-derived. When an event comes out of a known automation or campaign, the graph inherits the parent from the workflow definition.
- Inferred. When nothing declares causality, agents propose edges from statistical patterns, and a human approves them. Inferred edges stay typed as inferred. They are never silently promoted to declared.
And here is the honesty that the literature demands and that most "causal AI" marketing skips: an inferred edge is exactly where confounding hides. A high correlation between price_change and churn does not earn a causal arrow, because demand volatility or a competitor launch may drive both. This is why inferred edges are hypotheses, not facts, and why the only way to settle them is the adjustment-set discipline above or an experiment. A semantic layer's job is not to fake Rung 2. It is to make the Rung 1 structure explicit, attach provenance, and refuse to pretend the rest.
What this changes, and what it is worth
For the analyst, ad-hoc causal questions stop being multi-hour archaeology. They write the chain (price_change -> demand_drop -> churn_risk -> retention_offer -> save) and the planner walks it, with the same adjustment set every time, reproducibly.
For the audit team, every causal claim carries a source (declared, workflow, or inferred-and-approved) and a graph version. "This number went up because of that campaign" stops being an opinion and becomes a query against typed edges, replayable against the causal structure that was in force on the day the decision was made.
Is the business case real? The most defensible number in the literature is from Brynjolfsson, Hitt, and Kim's study of 179 large publicly traded firms: those that adopt data-driven decision-making "have output and productivity that is 5-6% higher than what would be expected given their other investments and information technology usage," with effects also in asset utilization, return on equity, and market value, and the authors used instrumental-variables methods to find the effect "do not appear to be due to reverse causality." The cost of getting causality wrong is just as measurable. Across 15 large randomized experiments at Facebook (about 500 million user-experiment observations), Gordon and colleagues found that standard observational attribution methods "often fail to accurately measure the true effect of advertising," and "in half of our studies, the estimated percentage increase in purchase outcomes is off by a factor of three" (Gordon, Zettelmeyer, Bhargava & Chapsky, Marketing Science, 2019). eBay's randomized paid-search experiments implied a return on ad spend of about −63% once measured causally, because consumers simply substituted to organic results; Blake, Nosko & Tadelis (Econometrica, 2015) report that "returns from paid search are a fraction of non-experimental estimates" and that "brand keyword ads have no measurable short-term benefits." Every one of those errors is a confounder that a dashboard could not see.
Where this sits in your stack
Your event-streaming platform (Kafka, Pulsar, Redpanda) captures events and guarantees per-partition order. It does not encode causality. Your orchestrator (Airflow, Dagster, Prefect) has DAGs, but they are task and asset-dependency graphs and schedules, not cause-and-effect of business outcomes; Dagster's asset lineage is the closest cousin and still answers "how was this table built," not "what caused churn." Your causal inference libraries (DoWhy, EconML, CausalImpact, the grf causal forest) are excellent at post-hoc analysis, the model-identify-estimate-refute loop, but they are notebooks run after the fact, not structure enforced at query time. DoWhy even ships named root-cause notebooks for supply-chain distribution changes and microservice latency, which is exactly the shape of question enterprises ask, just not wired into the serving layer.
Semantic layers (dbt's MetricFlow, Cube, Colrows) are where typed relationships, first-class metrics as causal nodes, and versioning already live. None of them ship causal annotations today. That is the open space. Colrows is positioned for it because the graph already has typed relationships and versioned metrics; adding causal direction and provenance to an edge is an extension of a model that exists, not a new product.
To be precise about scope: capturing causal annotations, versioning the graph, and tracking provenance is foundational work that is achievable now. Full do-calculus enforcement at compile time, refusing a query that conditions on a post-treatment variable the way a type checker refuses a type error, is the direction, not a shipped feature. The honest claim is the smaller one: warehouses are state systems, semantics are relationship systems, and the hidden logic of the enterprise has been sitting between the rows the whole time. Modeling events as graph entities with typed graph traversal of causal edges is the first thing that makes that logic queryable instead of tribal.
FAQ
What is the difference between correlation and causation in analytics?
Correlation, P(Y|X), says X and Y move together. Causation, P(Y|do(X)), says intervening on X changes Y. Dashboards report correlation; decisions like "raise spend 10%" need causation. They diverge whenever a confounder influences both X and Y.
What is a DAG in causal inference?
A directed acyclic graph where nodes are variables and a directed edge X → Y means X directly causes Y. It encodes your causal assumptions explicitly so they can be tested and argued with.
What is a confounder?
A variable that causes both the treatment and the outcome, creating a backdoor path that produces non-causal correlation. Age confounds shoe size and reading ability; department confounded gender and admission at Berkeley.
What is the backdoor criterion?
Pearl's graphical rule for choosing an adjustment set S: no member of S is a descendant of the treatment, and S blocks every backdoor path. Condition on a valid S and the remaining association is the causal effect.
What is do-calculus?
Three rules from Pearl (1995) that convert an interventional query P(Y|do(X)) into an estimable observational expression given a DAG. It decides identifiability and produces the estimand when one exists.
Why is conditioning on more variables sometimes wrong?
Because of colliders and post-treatment bias. Conditioning on a node with two arrows into it, or on anything downstream of the treatment, opens a spurious path and biases the estimate. More controls can make the answer worse.
Do event-streaming and workflow tools capture causality?
No. Kafka and similar guarantee per-partition ordering; Airflow, Dagster, and Prefect model task and asset dependencies and schedules. None encode the cause-and-effect of business outcomes.
Can timestamps prove causality?
No. Temporal precedence is necessary but not sufficient. In distributed systems, ordered timestamps do not imply causation (Lamport's happened-before only gives a partial order), so the causal graph has to be captured, not inferred from time alone.
What can a semantic layer do for causality today, and what is still future work?
Today: capture typed causal edges with provenance, model metrics as causal nodes, and version the graph for replayable, auditable decisions. Future work: enforcing do-calculus constraints at compile time, for example refusing queries that condition on post-treatment variables.
