Someone notices a metric stopped behaving the way it used to. Someone remembers a column changed meaning six months ago. Someone rewrites a query because a table was altered upstream. Someone explains, again, why two numbers do not match. None of this is planned. None of it ships. But without it, trust collapses fast.
Here is the part nobody wants to say out loud: that work is now most of the job.
Maintenance is a tax, and it scales faster than you can hire
Fivetran's Enterprise Data Infrastructure Benchmark Report 2026, a survey of 500 senior data leaders at large enterprises, found that 53 percent of engineering time goes to maintaining existing pipelines rather than building anything new. At organizations running more than 200 active pipelines, that figure climbs to 61%. dbt Labs' State of Analytics Engineering 2026 reached the identical 53% across 4,200 practitioners, from the other direction. The Fivetran report also put average annual data program spend at $29.3 million, with about 4.7 pipeline failures per month, roughly 13 hours to resolve each, and an average of $3 million per month in business exposure to downtime.
Stack the rest of the industry's numbers on top. Gartner's 2020 data quality research estimates poor data quality costs the average organization $12.9 million a year. McKinsey's 2019 Global Data Transformation Survey found employees lose an average of 30% of their total enterprise time to non-value-added tasks caused by poor data quality and availability. McKinsey's 2020 cost research adds that data users can spend between 30 and 40 percent of their time searching for data and another 20 to 30 percent cleaning it when controls are weak.
This is a tax, and it has a nasty property: it scales with data volume, and data volume scales faster than headcount. You cannot hire your way out. Data engineering roles take 60-90 days to fill in enterprise settings, and demand for the AI-adjacent skill mix outpaces supply. Every new source, every new consumer, every new model on top of the warehouse adds maintenance surface. The team grows linearly at best. The surface grows combinatorially.
Google's site reliability engineering practice named this problem years ago. They call it toil: work that is manual, repetitive, automatable, and devoid of enduring value, and that scales linearly as a service grows. The Google SRE Book caps toil at 50% of an SRE's time precisely because it is corrosive past that point, and Google's own quarterly surveys put average measured toil around 33%. There is also a structural floor: in a six-person on-call rotation, two of every six weeks go to interrupt handling, so 33% toil is the lower bound before anyone touches a real problem. An eight-person rotation only gets you to 25%.
Most data stacks are held together by memory
Look closely at a modern data stack and you find it is held together by human memory. Why this metric exists. Why that join is written a certain way. Why a definition changed. Why this dashboard can be trusted and that one cannot.
That knowledge does not live in one place. It lives in Slack threads, old pull requests, meetings, and people's heads. When someone leaves, the system loses context. US private-sector voluntary turnover runs 22-25% a year, replacing a mid-level employee costs 6-9 months of salary, and 47% of organizations name key-person dependency as a significant operational risk, up from 34% in 2020. The meaning walks out the door with the person.
The sibling problem is what we have written about elsewhere as knowledge drift and semantic decay. A definition shifts slightly. A metric gets used in a new context. A rule is updated in one place and not another. Nothing crashes. No alert fires. But the system slowly stops telling the same story it used to. Revenue still looks like revenue. Churn still looks like churn. The numbers are quietly diverging underneath.
Why this is finally a systems problem, not a people problem
What changed recently is not attitude. It is capability. We now have systems that can observe schemas, usage, queries, and definitions continuously, not once a quarter and not only when something breaks.
That opens a different path. Instead of waiting for a human to notice drift, a system can detect it. Instead of relying on memory, it can compare past and present meaning. Instead of reacting after damage is done, it can surface issues early.
This is where agents come in. And the word matters, because it is being abused.
Agents versus automation: the distinction that actually matters
Traditional automation follows fixed "if-then" logic. A cron job runs. A test passes or fails. A rule fires. It is predictable and it is brittle: it only handles the cases you anticipated.
An agent evaluates context, weighs multiple factors, and decides. A multi-agent system is a network of specialized agents, each with a defined role, coordinated by an orchestration layer that assigns work, resolves conflicts, and contains failure. The shift is from executing rules to reasoning over goals.
For data maintenance, the useful agent roles divide cleanly:
| Agent type | Job | Example action |
|---|---|---|
| Discovery | Map what exists | Ingest schemas, metadata, docs; identify candidate entities and metrics |
| Architecture / validation | Enforce correctness | Validate grain and dependencies; refuse definitions that violate business logic |
| Learning | Improve from usage | Observe how humans and agents use definitions; refine synonyms and examples |
| Monitoring / remediation | Catch drift | Detect distribution shifts, schema changes, broken assumptions; trigger playbooks |
| Compliance | Prove control | Maintain lineage, attribution, and audit trails for regulators |
The orchestration layer is not optional decoration. In a non-orchestrated system, one agent produces a flawed output, a second consumes it, a third builds on that, and the error surfaces only after business impact. Orchestration defines where errors get intercepted and when a human is pulled in.
Why full automation is the wrong goal
It is tempting to want the system to just fix everything. Resist that.
Multi-agent systems have a failure mode that single agents do not: cascading and correlated failure. An incorrect output from one agent, consumed by a second, compounds in the third. Worse, agents that share the same underlying model, prompts, or configuration tend to fail in the same way at the same time, so your "independent" checks are not independent. One analysis of a 55-scenario multi-agent simulation found, counterintuitively, that populations with only 10% honest agents achieved 74% higher collective welfare than fully honest populations, because total trust let a single bad signal cascade unchecked. Skepticism, designed into the system, is a feature.
There is also a human failure mode. The literature on automation bias is unkind to the assumption that a reviewer watching a stream of agent proposals reviews each one carefully. The Lyell and Coiera 2017 meta-analysis found that an erroneous automated recommendation raised the likelihood of an incorrect human judgment by 26 percent. Under load, the prior shifts from scrutiny to practiced approval. An approval surface is only as meaningful as the person reading it.
So the mature design is not "automate everything." It is risk-tiered autonomy. Classify actions by reversibility:
- Read-only (observe, profile, flag): let the agent run.
- Reversible (re-run a partition, refresh a cache): auto-execute, sample-review 5-20%.
- External or hard-to-undo (change a definition, alter a downstream contract): require human approval.
- High blast radius (delete, mask, anything regulators care about): mandatory human sign-off with full context.
The rule of thumb from practitioners: trigger an interrupt where reversibility ends. Approve cheap-to-undo actions automatically. Gate the ones whose side effects survive the response. This is also what regulators now expect. The EU AI Act's Article 14 and the NIST AI Risk Management Framework both require demonstrable, trained human oversight, not a checkbox.
What continuous monitoring actually looks like
Modern data observability converges on five signals: freshness, schema, distribution, volume, and lineage. Agents learn the normal pattern for each and flag deviations before a consumer notices.
Self-healing follows a three-part loop: dense observability feeds intelligent detection, which feeds controlled remediation. Published research on self-healing pipelines reports detecting 94-96% of anomalies before they hit downstream systems, with remediation playbooks handling common failure modes like replaying a bad partition, isolating corrupted batches, or rolling forward a schema migration. AI SRE deployments report MTTR reductions in the 30-50% range when implemented well, though vendor marketing often claims 50-80%; plan with the lower number.
The speed difference is the whole point. Manual response runs at human speed: someone has to notice, triage, diagnose, and fix, often hours after the fact and often at 2 AM. An agent runs the detection-to-diagnosis step at machine speed and continuously. The gap between "what changed" and "when someone noticed" is exactly where models learn the wrong signal and trust takes a hit. Closing that gap is the value.
The sane rollout is staged, and every credible source says the same thing:
- Instrument and observe (30-60 days). Schema monitoring, quality profiling, volume tracking, lineage. Goal is visibility, not automation. Most teams discover silent failures they never knew about.
- Automate detection and diagnosis. Statistical baselines plus lineage correlation. Remediation stays manual. Detection-to-diagnosis compresses from hours to minutes.
- Expand remediation authority. Start with low-risk, well-understood failures. Keep escalation policies, human checkpoints, audit logging, and rollback.
How agents compare to the tools you already run
Agents do not replace your orchestrator or your transformation layer. They sit at a different altitude.
| Tool | What it automates | What it does not touch |
|---|---|---|
| Apache Airflow | Scheduling, orchestration, dependencies, retries | Meaning of the data; metric definitions |
| dbt | Transformations, tests, data contracts (catches schema drift at build time) | Whether a definition still matches how the business uses it |
| Databricks / Snowflake | AI-assisted authoring (Agent Bricks, Cortex Code), quality monitoring | Cross-tool semantic consistency |
| Observability (Monte Carlo, Acceldata, Bigeye) | Anomaly detection, lineage, alerting | Resolving conflicting definitions of the same metric |
Notice the pattern. These tools operate on pipelines and tables. They are very good at it. dbt's contract enforcement will block a schema-breaking change at build time, which is genuinely valuable. But none of them own the layer of meaning. They can tell you a table changed. They cannot tell you that "active customer" now means something different to finance than it does to growth, and that three dashboards have quietly diverged as a result. That is the semantic layer, and in most stacks it is still a fragile, human-maintained artifact: a glossary, a Confluence page, a metric defined four times in four tools that no longer agree.
Where Colrows fits, and where it does not
Colrows is a semantic execution layer. It sits between users, AI agents, and your data systems, resolves business meaning at compile time, and runs governed SQL. Maintenance is built into that layer rather than bolted on.
To be precise about scope, because vague product claims are part of the problem this article is about: Colrows runs four named maintenance agents over its semantic layer, which is called Consensus. It is not a "five-agent" system, and Consensus is the layer, not an agent. The four are:
- Discovery agents ingest schemas, metadata, documentation, and pages like Confluence to identify candidate entities, events, metrics, and relationships.
- Architecture agents validate grain, dependencies, and constraints, and refuse to publish definitions that violate business logic.
- Learning agents observe how humans and AI systems actually use the graph and refine definitions, examples, and synonyms accordingly.
- Monitoring agents detect semantic drift using statistical fingerprinting of column distributions, structural diffing of dataset nodes, and hybrid vector-plus-structural equivalence analysis.
Conflict resolution is structural, not a vote. Two metrics are treated as equivalent only if their normalized expression trees and dependency sets match under canonical ordering. Vector similarity is used only to find candidates; structure makes the final call. This is what keeps the graph from filling up with near-duplicate definitions.
Three properties make this safe to put near production:
Governance is enforced at compile time. RBAC and ABAC policies are attached to the graph as policy nodes. A persona resolves to an allowed subgraph, and compilation happens only within it. If a metric depends on a node outside the permitted scope, resolution fails. There is no post-hoc masking to forget, because an unauthorized plan is never generated.
Join paths are proven, not guessed. When a metric spans datasets, Colrows solves the join as a constrained graph traversal, pruning paths that violate grain or expand cardinality beyond thresholds. If more than one valid path exists and nothing disambiguates them, compilation fails rather than silently returning a wrong number. The single largest source of bad numbers in enterprise BI is the silent join. Failing closed makes that class of error unreachable.
Everything is versioned and attributed. Every change to a node is versioned, attributed to a human or an agent, timestamped, and linked to the agents, dashboards, and signals it affects. Changes never overwrite prior definitions, so point-in-time reproducibility is a free property: any historical query can be re-executed against the exact semantic state that was active when it ran.
An honest limitation: Colrows' documentation describes Architecture agents enforcing an automated publish gate (definitions that violate business logic do not get published) and a versioned, attributed change history. It does not document a formal "proposed versus published" staging queue with a named human approval step for every agent edit. If your governance posture requires a human to sign off on each semantic change before it goes live, treat that as a design requirement to confirm and configure, not an assumption. The risk-tiering framework earlier in this article is the right lens: decide which classes of change you are willing to let agents apply autonomously, and which must wait for a person.
The compliance angle nobody can skip
If you operate under SOX, HIPAA, or GDPR, autonomous maintenance is not just an efficiency story. It is a control story, and it cuts both ways.
The risk: an agent that changes a definition or moves data without a durable record is a compliance liability. Without session-level lineage, you cannot reconstruct what happened.
The opportunity: the same versioned, attributed, timestamped change history that makes agents debuggable is exactly what auditors ask for. Data lineage is the continuous, explainable record that GDPR, HIPAA, SOX, and BCBS 239 require. HIPAA mandates audit controls under 45 CFR 164.312(b) and six-year log retention. SOX needs traceability from source system to final disclosure. GDPR's data minimization and right-to-explanation provisions require knowing what was accessed, when, and why. An architecture where audit is a side effect of normal execution, rather than a separate logging project, turns a regulatory burden into a queryable asset. The separation of who proposes a change from who approves it, which agent gates give you for free, also satisfies separation-of-duties expectations.
What changes when the system maintains its own meaning
When maintenance becomes continuous, the feel of the system shifts. People stop second-guessing numbers as often. The "can you explain this?" messages drop. Dashboards feel calmer. AI outputs feel less risky because they are grounded in definitions that are actually current.
This is not intelligence in the flashy sense. It is reliability. And it changes what the team is for. Engineers move from firefighting to judgment. The mechanical work of detection, correlation, and routine repair, which consumes the majority of incident response time, moves to the agents. The humans keep the part that requires a human: deciding what matters, asking new questions, and setting the policy for what agents may do alone.
For a long time we built data systems that worked only because a few people constantly patched the gaps. That was never sustainable, and the numbers now make it indefensible: a majority of your most expensive engineers spending the majority of their time keeping the lights on. Systems that carry their own meaning forward, detect their own drift, and prove their own changes will simply age better than the ones that depend on memory. Once you work with one, it is hard to unsee how fragile everything else feels.
The next generation of data platforms will not be defined by how much they store. They will be defined by how well they preserve understanding over time. Because data does not rot on its own. Meaning does.
