AI agents invade observability: snake oil or the future of SRE?

The Mooster and friends would like to join your ops team.

Sep 18, 2024

This newsletter was started 5 years ago to explore emerging observability and monitoring startups. In the most boring sense, these companies take operational data and create insights from that data for humans. This always involves a lot of dashboards, alerts, API integrations, and a large monthly bill.

The businesses sometimes take off—Cribl has raised $600 million dollars since first mentioned in 2019—and sometimes the technology takes a while to reach maturity (looking at you, eBPF). There’s also the standard IT hype cycle and new use cases for monitoring, like developers building applications using large language models or quasi-trends like data observability. In general: logs, metrics, traces and events go in, useful information for a developer or operator should come out.

As of fall 2024, a growing number of people and large software companies believe this model is about to fundamentally change with advances in AI. This issue of Monitoring Monitoring is about what that could look like for the observability business… if “agentic” generative AI lives up to the hype.

Agents everywhere (even CRM isn’t safe)

The next wave of gen AI is all about agents, and not at all in the “server monitoring agent” or “instrumentation agent” sense (like Dynatrace OneAgent).

This new agent (or “agentic”) focus refers to large language models (LLM) that can take actions using real world data with human approval. The latest models from OpenAI, for example, do a great job of solving very complicated programming, math, and science problems using advanced reasoning. But ChatGPT can’t do things like use notes from sales calls and internal customer data to negotiate and close a contract… yet?

A quote from Marc Benioff is useful context at this point, minus his digressions on esoteric Japanese philosophy: “we have to pivot the whole company to agents” he recently told Fortune. (The massive Salesforce conference in San Francisco, Dreamforce, has now been renamed Agentforce.)

Compared to CRM and fintech, monitoring companies are just starting to arrive at the agentic AI party. The big question: what happens if agentic AI becomes as good as a junior operator or developer at understanding operational data, including connections between different signals and systems?

Meet the newest member of your ops team

There are many startups emerging in the agentic monitoring space, all of which have capabilities that go beyond the chat interfaces that have been released by major vendors to date. Very broadly, they fall into the following categories:

DevOps and/or incident response agents trying to automate away parts of on-call or routine maintenance. The bots you can choose from include Kura (from Kura), OneGrep Bot (from OneGrep) and The Mooster (from Wildmoose). Beeps is also in this space but seems to be pre-launch (and have not revealed a bot name)
“Platforms of agents” or agent toolkits to increase automation. RunWhen has built a solution around agents that can automate multiple types of engineering tasks, and Acorn Labs has developed an open-source toolkit to build generic agents of any type.
Expert site reliability engineer (SRE) agents with domain specific knowledge of cloud or Kubernetes. Parity and Cleric position themselves in this area, although the line between SRE-focused and DevOps agents is blurry. There’s also the open-source project k8sgpt.

The marketing for all of these solutions is about doing more with less, and in some cases moving away from the “co-pilot” or “assistant” language towards something more embodied: Parity and Cleric position their solutions as an actual SRE who joins your team: “We’re building an operator, not another tool.”

More AI snake oil, or something else?

This isn’t the first AI hype cycle in the monitoring space. The various features sold in the past 5-7 years as AIOps, from the practitioner standpoint, were (at best) a mixed success. The machine learning algorithms of the 2010s did not fundamentally change the on-call or incident response workflow. Advanced anomaly detection is helpful but it tends to get turned off (or removed from the day-to-day) if false positives repeatedly wake people up at 3am.

Venture capitalists think this time might be different. a16z, writing about the CRM industry, laid out what it thinks are the platform components of a Salesforce-killer:

[T]he core of the next sales platform could be entirely unstructured and multimodal, including text, image, voice, and video. A company’s sales platform could include data about existing and prospective customers from countless sources… Furthermore, the LLM powering the platform would be constantly ingesting data to create the most up-to-date context.

If we replace “voice and video” (fortunately not a key part of the job in SRE) with “operational data”, ticketing systems, on-call runbooks, documentation, and source control—you get a marketecture that looks identical to the solutions many of these startups are building.

This is different from the 2010s historical model of monitoring startups that started by creating a new type of database inspired by an internal system learned at a large technology company. Instead of web scale, the secret sauce is connecting many more sources of real time, operational, and internal data that are brought together via proprietary LLM glue.

That’s the concept at least, and if it works it would change the way practitioners do observability, monitoring and incident response.

Benchmarks and murder mysteries

If every major APM vendor and dozens of startups release agents in the next year, it will be difficult for customers to tell what’s snake oil or what’s actually useful. One approach, also seen in the financial space, is having open benchmarks for assessing how well agents can answer questions and show domain-specific knowledge.

In the past week, Parity released the first known benchmark for Kubernetes or cloud called SREBench—you race their agent in a simulated cluster to see who can diagnose the root cause faster. They built the benchmark, according to their blog post, by integrating concepts from the closest existing benchmark they found that could be applied to modern-day SRE: solving murder mysteries.

While there are many different types of AI benchmarks available, more domain-specific ones for operational and incident response tasks are welcome and needed. Benchmarks are far from perfect, but could help future customers of agents have a starting point for measuring how effective an agent is at solving problems in a simulated environment. Otherwise, like the previous AIOps wave, we’re relying mostly on analyst reports and white papers provided by eager sales teams on how much your mean time to resolution (MTTR) will decrease.

Dollar signs and dilemmas

We’re in the very early days of understanding the impact of LLM-based agents on the observability space. Besides the questions around how effective these agents will be in complex operational environments, there’s an equal number of data privacy and regulatory questions that have yet to be resolved. Are you (or EU regulators) willing to give The Mooster your transaction logs with potential PII data? There’s also the open question of how to, well, monitor these agents for compliance and safety.

Pricing is another concern. If effective SRE agents require large amounts of operational data and lots of NVIDIA GPUs to do their job: add a few more zeros to the bill from your favorite APM vendor.

If you’re a venture capitalist, you see dollar signs. If you work in the monitoring space, you might see the end of your job. If you’re a customer of an monitoring vendor, you’re probably just tired of hearing about AI.

As it’s Agentforce in San Francisco this week, last word goes to Benioff from his Fortune interview.

“It is about driving through the innovator’s dilemma.”

🚗🚗🚗

Subscribe to the newsletter for updates to see how this evolves in the next few months.

If you read this thinking it was going to be about LLM monitoring, check out this post from early 2023: Large Language Model Observability.

Large Language Model Observability

Startups leveraging LLMs for monitoring, testing, and incident response

Clay Smith

Apr 04, 2023

Welcome back to Monitoring Monitoring. This issue highlights companies focused on solving observability challenges for teams building applications using Large Language Models (LLMs) like OpenAI’s GPT-4.

The current (April 2023) technology hype cycle around LLMs is unlike anything in recent history. It’s clear something is happening. Last Friday, around 5,000 people showed up for the San Francisco Open Source AI meetup — roughly how many people showed up in person to KubeCon North America last year. At least one person compared the AI meetup to Woodstock.

Let’s take a look at some startups focused on solving the problems of the many developers and companies rushing into the space to build new applications on top of LLMs:

LLM Observability

Developers building LLM-based apps aren’t necessarily writing a lot of code. They’re writing prompts in English sent to an API (“AI as a service”). Here’s an example of an input with OpenAI’s GPT-3:

Input prompt: Write a tweet about how a large language model takes an input and produces an output. Put the tweet in square brackets.

Output from API: [Large language models use advanced AI techniques to analyze and comprehend an input, and generate a response that reflects the learned patterns and relationships within the input sequence. #AI #LanguageModels]

This actual “low code” technique makes some things easier but other things harder. For example, how do you track the quality of prompts you are writing? How do you optimize the prompts for fast responses, or choose the best LLM model? What about cost optimization, since it can get fairly expensive?

YCombinator-backed Helicone helps teams answer those questions by intercepting their OpenAI API calls with a single line of code and summarizing their requests in a nice dashboard:

Vellum, another Winter 2023 YCombinator startup, is building more of a general-purpose developer platform for developing, monitoring, and fine-tuning LLM applications. As Vellum says in their docs, the critical piece is capturing all the inputs and outputs from a developer’s application to the LLM: “Every model input, output, and end-user feedback is captured and made visible at both the row-level and in aggregate.”

If these “LLMOps” (sorry) companies are successful in helping their customers build amazing new services using this technology, how could that change monitoring, testing, and DevOps?

Generative Testing and Incident Response

The most well-known AI model used in development and technical workflows to date is GitHub’s Copilot (“Your AI Pair Programmer”). Startups are starting to explore what it looks like when the same underlying technology can automate more than just code. Traceloop, for example, says on their website they can use the combination of distributed traces and generative AI to automatically generate tests and improve reliability.

Wild Moose ambitiously takes on the incident response space with a Slack chatbot that summarizes the entire state of your system in response to questions during critical incidents… and then summarizes what happened in a nice executive summary afterwards. As they say on their site, “Unlike humans, our AI is happy to wake up at 2 AM.”

Screen capture from Wild Moose’s Slackbot (April ‘23)

Generative Testing and Incident Response

The 2020 newsletter linked to a well-known Powerpoint from Princeton called How to recognize AI snake oil. The presentation made a (still valid) point that the AI hype cycle, despite some legitimate technical breakthroughs in the space, more often leads to “fundamentally dubious” marketing claims about capabilities of various products or solutions.

We’re almost certainly going to see many, many more of these startups focused on solving problems in DevOps, monitoring, observability… or just figuring out how to interact with Kubernetes in plain English.

Will we be able to get a quick and relevant summary of what’s going on after an on-call page at 2am? Will an APM provider’s chatbot ask you out on a date after sharing some latency metrics?

Subscribe to the newsletter for updates to follow what startups are building or follow on Twitter.

Disclosure: Opinions strictly my own and not my employers. I am not a consultant, employed, or an investor in any of the companies mentioned. There are no paid placements, sponsorships, or advertisements in this newsletter. This post was not written by a LLM, but I did have it generate some potential titles that I rejected. It also suggested I rename the newsletter to “Observing Observability”…

The Continuous Cloud Cost Optimization Business

Don't tell the finance team... or do?

Clay Smith

Jan 12, 2022

In the recent article, The Economist explored a relevant question concerning the “cumulus of data centres” (the cloud) that “makes geeks drool” (apparently):

If [businesses] entrust all their data—the lifeblood of the digital economy—to an oligopoly of cloud providers, what control do they have over their costs?

With development teams creating Kubernetes clusters everywhere from servers to first-person shooters and your 4x4, the consensus is that keeping cloud costs under control is getting harder, not easier. Bill reduction recommendations from a consultant or spreadsheet have a short shelf life.

Software that helps reduce your cloud bill is not new, but this newsletter focuses on startups that promise to continually optimize the cost of cloud-native workloads using monitoring, machine learning, and clever technical tricks.

Start your cost optimization engine

The technical core of all of these software solutions is something—like a dashboard, report, recommendation engine or monitoring agent—that makes or suggests an adjustment to reduce cost while balancing performance.

Stormforge takes an experimental approach: it scans clusters, optimization goals are set, then load tests are run in a test environment to find the ideal set of parameters for Kubernetes workloads. With some machine learning, you get a nice dashboard that shows the most cost-effective solution that exports to developer-friendly Kubernetes YAML:

Opsani’s approach works similar to many monitoring and observability solutions: they deploy a component directly in a production cluster that uses metrics, forwarded to a machine learning engine in their SaaS, to start identifying optimal configuration scenarios. While Opsani says infrastructure adjustment (like scaling up or down) can be done automatically using their product, cost-savings recommendations are also surfaced in their product for human intervention.

Taking a batteries-included approach to Kubernetes cost-optimization solution, cast.ai offers a distribution of Kubernetes designed for lower costs and ease-of-management by selecting the best cloud instance type, scaling policy, and identifying inefficiencies.

Cost savings with novel technical tricks

If you look at their marketing page, Exotanium also talks about cloud cost savings and optimization, but also mentions some “secret sauce from Cornell University”.

The technology is explored more in a 2020 article at The New Stack and paper that describes a new kind of virtualization later that enables workloads to be moved between computers without stopping —something that is not currently possible on any major cloud provider. Exotanium’s insight is that if running workloads are suddenly portable across regions, clouds, and instance types using their tech, then they can be quickly moved to other (i.e. cheaper) locations for cost-savings. It’s clever.

Like APM, but the finance team loves it

San Francisco-based KubeCost combines a cost-optimization approach with governance features inspired by Google internal tools. Recommendations and cost monitoring are combined with dashboards, reports, and real-time alerts to help teams make better decisions.

The KubeCost CEO, Webb Brown, put it in well in a Medium post announcing their funding:

Solving the problem of runaway Kubernetes spend—and empowering developer teams to manage these costs—starts with giving DevOps and engineering teams visibility into the black box that is Kubernetes spend.

Where have we heard this before?

As mentioned in the previous newsletter with the recent acquisition spree of logging startups by security companies, the messaging might seem familiar to anyone who has been following the monitoring space. To recap:

It’s gotten too hard for technical teams to manage
Built-in or free tools aren’t working
Better, real-time data is needed for visibility
Alerts, dashboards, machine learning, and recommendations help teams continually optimize and make better operational decision

Application monitoring data, first used at scale by engineering and ops teams addressing reliability or performance, is in the process of learning a new trick when combined with some cost information and machine learning.

And the finance team will love it.

Subscribe to get the next issue in your inbox or follow on Twitter.

Where are all those monitoring startups now?

Hint: Acquired or they raised a $200m Series C

Clay Smith

Oct 06, 2021

Welcome back. I’m restarting this newsletter after a long break. Since the last post there have been dozens of acquisitions, fundraising rounds, emerging standards, new buzzwords, interesting trends, more complaining about Kubernetes, and analyst firms still trying really hard to make AIOps a thing.

So... what happened to the startups you never heard of in previous issues?

In the first post of this newsletter, I profiled a small number of monitoring-related companies that appeared in the startup sponsorship tier at KubeCon 2019. Here’s what happened to the dozen originally-profiled scrappy upstarts in the first issue, two years later:

Five were acquired
Two raised $200m+ Series C rounds in mid-2021
One raised a $40m Series B round in early 2021
… and everyone is still in business

Chronosphere raised a Series B and became generally available, Grafana Labs raised a big Series C, and Epsagon and Sensu were absorbed into larger software companies. The other companies are writing new new chapters in the previously-covered observability pipeline and eBPF posts. Several also got expensive-looking logo redesigns. Updates follow.

Observability pipelines redirect to /dev/money

It’s not just Cribl’s $210m Series C round: if we agree with the various people that claim to have first said that “data is the new oil” — who’s going to provide the specialized infrastructure to efficiently move around all of those logs and metrics, particularly in this era of egregious cloud egress costs and large Splunk bills?

Defined in the first newsletter as “a workflow for filtering and routing operational data”, observability pipelines now seem to be appearing in an enterprise software architecture PowerPoint deck near you.

In February 2021, DataDog acquired Timber Technologies, the company that built an observability pipeline called Vector. The press release explains why:

With the addition of Vector, we will be able to give our customers even more control over how their observability data is ingested, enriched, stored, and routed, so they can build fully capable, cost efficient data pipelines in both cloud and on-premise environments.

From posts on Confluent’s blog to an open-source project in the CNCF called Tremor that (wonderfully) calls itself “a kind of sophisticated /dev/null device with a few fancy knobs attached” — it seems likely we’ll be hearing more about this in 2022.

eBPF startups meet the Corporate Development team

eBPF was profiled in this newsletter as a deeply nerdy and technical emerging technology in the monitoring space that had a lot of promise. In hindsight, 2019 was a great year to be an engineer with deep expertise in eBPF and a check from a venture capitalist.

There’s still a lot of excitement and a growing community around eBPF, but take a look at what’s happened:

Flowmill gets acquired by Splunk
Pixie Labs gets acquired by New Relic
Cmd gets acquired by Elastic

Flowmill, now part of Splunk, has open-sourced most of its technology and recently gave an early look about how it integrates with OpenTelemetry at the most recent eBPF Summit.

As noted in Elastic’s acquisition of cmd, some of the recent excitement around eBPF is also driven by security use-cases, specifically enabling customers “to detect, prevent, and respond to attacks on their cloud workloads”.

If that quote sounds familiar, you may have also read the press releases about acquisitions of some logging startups.

AppSec detects logging startups

Quiz time. Name the security vendor in their press release announcing the acquisition of a logging startup:

____(A)____ will further expand its eXtended Detection and Response (XDR) capabilities by ingesting and correlating data from any log, application or feed to deliver actionable insights and real-time protection.

____(B)____ will be able to ingest, correlate, search, and action data from any source, delivering the industry’s most advanced integrated XDR platform for realtime threat mitigation across the enterprise and cloud.

Answers: A) CrowdStrike acquires Humio on 2/18/21 and B) SentinelOne acquires Scaylr on 2/9/21.

The general theme, as articulated in the CrowdStrike press release, is a “unified data layer that powers the next generation of enterprise security and IT operations”. With these acquisitions, a reminder that the line (and IT budget) is getting blurrier between pure operational logging and application security solutions.

What’s in the next newsletters?

In the next few issues, this newsletter is going to explore late 2021’s emerging themes in the monitoring and observability startup space. Next issue will be all about the startups that want to help you optimize costs... for a price.

Subscribe to get the next issue in your inbox or follow on Twitter.

Why the long newsletter hiatus? I left consulting in mid-2020 and joined a startup in the observability space... then it got acquired by a much larger software company in mid-2021.

Disclosure: Opinions my own and not my employers. I am not a consultant, employed, or an investor in any of the companies mentioned. There are no paid placements, sponsorships, or advertisements in this newsletter.

Service Meshes, Kiali, and Continuous Verification

What happens when observability data meets configuration management and deploys?

Clay Smith

Apr 16, 2020

This is Monitoring Monitoring, a quasi-monthly email newsletter about early-stage startups and projects in the observability space. Subscribe here.

There are good reminders that Kubernetes might not be the solution to all of our cloud software problems. In a recent tweet, Jaana Dogan compared Kubernetes to one of the deeper layers (L4) of the OSI model: container orchestration might be a foundational part of cloud native software, but it is also something that many developers and operations teams can avoid interacting with—if they pay someone else to deal with it.

So what’s a friendlier application-focused abstraction if you don’t want to get lost in the deep weeds of Kubernetes and you’re not on the serverless bandwagon? In early 2020, your answer might be some kind of service mesh.

A service mesh manages how microservices (macroservices?) communicate with each other, including what happens when new software gets deployed. There are good security and networking use cases, but the focus of this newsletter this week is what happens when the service mesh meets monitoring.

Service mesh-palooza

The open-source ecosystem around service meshes is thriving. Istio, linkerd, and Envoy have spawned a number of paid and free solutions. Both AWS (AWS App Mesh), Azure (Azure Service Fabric Mesh) have based their managed solutions on Envoy. Google Cloud (Traffic Director) has a solution based on Istio. There is also a fascinating open-source project tailored for financial institutions called SOFAStack that seems to be powering (parts of) large banks in China.

In terms of enterprise SaaS meshes, there’s also Aspen Mesh (incubated by F5, based on Istio), Tetrate (based on Istio and Envoy), and Banzai Cloud’s Backyards.

When it comes to monitoring, the interesting aspect of service meshes is they enable observability data that were previously difficult to collect as a built-in feature of using the mesh with any kind of service. Traditional metrics (like request rate) and traces are labeled with the context of the workloads generating that request, cool visualizations included.

Enter Kiali, a service mesh meets observability project for Istio that seems to be supported by Red Hat/IBM. The interesting twist with Kiali is in the tagline: “Service mesh observability and configuration”.

Configuration meets observability

There are several compelling ideas in Kiali, including emerging methods for correlating observability data. As Kiali—and most major APM/logging vendors—provide solutions that combine metrics and traces, there are different technical approaches involved in combining “the three pillars of observability” in a single interface to enable easy troubleshooting. See this technical talk from Chronosphere cofounder Rob Skillington on deep-linking metrics and traces for how this is done.

Another concept in Kiali is how it uses its privileged position in your infrastructure—it sees all service-to-service communication in the mesh—to help teams manage complex configuration using two different approaches:

Validations: Automatically identify configuration mistakes at runtime.
Wizards: user interfaces that help teams modify service mesh configuration rules without making changes to complex files by hand.

The general idea is if problems are detected, it’s possible to fix them immediately inside Kiali. Metrics are combined with awareness of runtime configuration to answer questions and fix problems in the same tool with human intervention. It’s a cool idea (certified dope by Kelsey Hightower), and a new concept that integrates significant configuration changes directly inside an observability project.

Deploys meet observability

Like diagnosing a problem caused by configuration, observability data has always been central in determining if a deploy was successful. A popular solution is Kubernetes-based continuous deployment software called Spinnaker that integrates with the usual monitoring vendors. Spinnaker is also available as enterprise-focused SaaS from Armory, OpsMx, and Mirantis.

Verica, founded by author of the recently-published Chaos Engineering book Casey Rosenthal, extends this idea further with continuous verification (CV). CV is a method of proactively identifying issues in a complex system using an automated system that verifies assumptions about a service—effectively what happens when you fully automate well-designed chaos engineering experiments, as Netflix did with their Chaos Automation Platform ChAP.

In a 2019 blog post and conference talk, Casey argues that CV techniques are the evolution of what teams have learned with continuous integration and delivery.

Continuous verification features also seem to be publicly available in the continuous delivery as a service startup Harness. Harness achieves this via integration with APM and logging vendors with some machine learning sprinkled on top.

The promising future of using observability data in new and clever ways

There’s a lot of attention on startups focused on the core problems of observability right now—different technical approaches that process, store, instrument, collect, and visualize data. Many of these emerging techniques have been covered in this newsletter, from eBPF and Prometheus databases, to observability pipelines.

However, some of the most compelling startups and projects right now are exploring what happens when you use observability data in clever ways to solve problems facing technical teams as they ship software—even if they decide they don’t need to deal with Kubernetes. New ways to safely deploy software, proactively identifty issues in a complex system, or detect configuration errors are just a few possibilities. There will be many more.

Thanks for reading. If you enjoyed this newsletter, feel free to share using the link below or subscribe here.

Disclosure: Opinions my own. I am not employed, consulting, or an investor in any of the mentioned companies or their competitors.

Loading more posts…