Last month, the observability community gathered in Minneapolis on May 21-22 for Observability Summit North America 2026. The conversations were some of the most honest and forward-looking I’ve seen in years.
Two powerful themes dominated the week: the growing problem of observability debt and the rapid emergence of agentic AI as both a challenge and a solution for modern operations.
Here are the most important takeaways for technology leaders.
1. Observability Debt Is Real — and It’s Quietly Eroding Trust
Spoorthi Palakshaiah delivered one of the clearest definitions I’ve heard: Observability debt is not missing telemetry. It is telemetry whose meaning has drifted because the underlying system changed, but the dashboards, alerts, SLOs, and metrics were never updated.
The result is dangerous false confidence:
• Dashboards show green while users are suffering.
• p99 latency improves dramatically (because work moved to async workers), but actual user wait time increases.
• Error rates look healthy because silent background job failures are invisible to the original HTTP metrics.
How it happens: Teams instrument → ship → gain confidence → then the architecture evolves to include async pipelines, new caching layers, microservices splits, and or team ownership changes. The observability layer stays static. Over time, the telemetry starts lying.
Executive implication: Your on-call teams may be fighting fires with increasingly outdated maps. This is a silent risk that grows with every architectural change.
2. The Industry Is Moving from Data Dumps to Smart, Goal-Centric Context
Thomas Johnson (CTO of Multiplayer) highlighted a critical problem with the current wave of MCP (Model Context Protocol) servers: most are still data-centric rather than goal-centric.
Current tools expose dozens of low-level operations (`get_dashboards`, `query_prometheus`, `list_alerts`). Agents need higher-level capabilities (`investigate_error`, `trace_user_journey`, `explain_deployment_impact`).
He also outlined where observability is heading:
• Pull becomes push (MCP triggers proactively surface relevant updates)
• Open loops become closed loops moving observability to being an active participant
• Data becomes pre-correlated by default (by session, user, or deploy)
• Collection becomes dynamic rather than permanent firehose
• Agents collaborate autonomously (keep in mind agents don’t collaborate well yet), with humans brought in only when needed
• Systems begin to heal themselves
The uncomfortable reality he shared is that AI agents are writing more code faster, but we’re seeing 1.7x more bugs in AI-generated code, +23.5% incidents per PR, and +30% change failure rate. MCP adoption is exploding, but it hasn’t yet solved the stability problem.
3. AI-Powered Root Cause Analysis Is Moving from Theory to Production
One of the most impressive real-world examples came from Nubank with 120+ million customers. They built their own AI SRE agent because no vendor fully met their requirements around cost predictability, SOC/compliance needs, data residency, and deep integration with their custom observability stack.
Key results:
• Launched in production in just 2 months by 2 engineers
• Runs at under $0.20 per investigation
• Delivers structured root cause analysis via a chat interface in seconds to minutes
This is a powerful signal. At significant scale and complexity, leading organizations are choosing to build targeted AI agents rather than waiting for perfect vendor solutions.
4. We’re Missing a Critical Layer: Decision-Level Telemetry for Agents
Several sessions highlighted that current observability (even with OpenTelemetry + MCP) tells us what tool an agent called, but not whether the decision was sound.
New work is emerging to add decision-level attributes — things like:
• Deviation from behavioral baseline
• Permission scope requested vs. granted
• Risk severity of the action
• Confidence and category of the tool use
This “third layer” of observability (infrastructure + protocol + cognitive/decision) is essential for detecting drift in agent behavior before it causes production incidents.
What This Means for Enterprise Leaders
The era of “set it and forget it” observability is over. As systems become more dynamic and AI agents take on more operational responsibility, organizations that treat observability as a living, evolving discipline will have a significant advantage in speed, cost control, and risk management.
Key priorities emerging from the conference:
• Actively manage observability debt as a first-class technical debt item
• Shift from data-centric to goal-centric interfaces for AI agents
• Prepare for decision-level tracing as agents take on more autonomous work
• Re-evaluate build vs. buy strategies for AI-powered operations capabilities
At Evolving Solutions, our Intelligent Operations practice is focused on exactly these challenges — helping clients modernize their observability posture, reduce mean time to resolution, and safely introduce agentic capabilities into their operations.
The future of observability isn’t more dashboards. It’s systems that can explain what happened, provide the reasoning behind the decision an agent made, and what should be done next — with minimal human intervention.
I’m looking forward to applying these insights with our clients in the months ahead.
– Your Self Appointed “AI Wizard” (Rael Rodning)