Chaos, Complexity and Catalogs

Startups helping operate and understand complex systems through chaos engineering or service catalogs

Welcome to all the new subscribers who like databases. This is Monitoring Monitoring, a quasi-monthly newsletter about startups in the observability space. You can subscribe here.

Monitoring is most appreciated when something breaks. Despite a lot of marketing (and snake oil) around AI, it’s still up to human operators to find a fix.

The lessons from a physician in a resilience engineering cult favorite paper from the nineties still resonate with people building software today. Consider these two points:

12. Human practitioners are the adaptable element of complex systems. 

18. Failure free operations require experience with failure.

The messy intersection of people, organizations, and complex (always failing) software is the focus of this newsletter. If we use monitoring, observability and dashboards to answer the question “what’s broken?”, here are two more questions:  

How do we break it? (#18) and Who owns it? (#12) 

Chaos as a Service, or “How do we break it?”

In 2011, Netflix had an idea to test a system’s ability to tolerate failures by deliberately injecting failures using a tool called Chaos Monkey. This was extremely influential, and nine years later, resilience and chaos engineering are about half of the sessions at SREcon in March and there’s now the equivalent of an Agile Manifesto for Chaos.

Gremlin (founded 2016) and ChaosIQ (founded 2017) are selling SaaS solutions to run chaos experiments and there is a long list of open-source tools in the space. My favorite project name is Bloomberg’s PowerfulSeal.

Gremlin and ChaosIQ have several integrations to run chaos experiments in different parts of the stack (like the network, cloud provider, or application). The idea, like Chaos Monkey, is you can deliberately create failure scenarios and learn what happened to build more resilient systems.

There are a couple different ways chaos solutions fit into monitoring (other using logs and dashboard to figure out what broke during an experiment). Gremlin has support to overlay chaos events on top of Datadog dashboards, while ChaosIQ has integrations for Prometheus and OpenTracing. The OpenTracing integration gets at a cool idea where running an experiment changes the observability of your system: when you start the experiment, detailed traces can be automatically collected.

Before you run chaos experiments, however, it’s a good idea to know what your services actually do and who owns them.

Actually Your Problem as a Service, or “Who owns it?”

There’s a very human problem related when an organization enters the late-stage microservice phase: it’s no longer clear from dashboards or internal documentation who owns a service or what a given service does—especially if the original owner was promoted (or left the company). Service naming can complicate understanding (see AWS Systems Manager Session Manager) and service metadata, in any case, are usually spread across clusters, data centers, and dashboards. There’s probably something clever to say about Conway’s Law here.

Enter OpsLevel, effx, Dive and GitCatalog. The key words on their marketing pages are “owned”, “tame”, “structured”, and “up-to-date”. Most provide integrations into different types of systems (clusters, CI/CD, alerting, messaging) so the service catalog updates automatically and doesn’t get stale on an outdated internal wiki page. According to Joey Parsons, founder of effx, the magic number where you start to feel human-to-services pain is around 20.

Detailed metrics and logs might be available for all your services, but these startups remind us that that doesn’t matter much if you can’t find the right person or documentation to interpret the data. 

More systems, more human problems

The work these startups are doing is a humbling reminder that regardless of really interesting emerging technology in the observability space, there’s still a massive amount of work to be done to connect the right people with systems and help them deal with (and anticipate) failure scenarios. 

The intersection of academics, developers, SREs and operators tackling these problems is one of the most interesting interdisciplinary groups thinking about what happens when humans meet software complexity. There is a very long list of hard problems related to understanding and operating the systems we build: “who owns it?” and “how do we break it?” are just two that involve a fast-growing number of companies, consultancies, and engineers.

Thanks for reading. If you enjoyed this newsletter, feel free to share using the link below or subscribe here.

Disclosure: Opinions my own. I am not employed, consulting, or an investor in any of the mentioned companies or their competitors.

See Also

Big Prometheus

Thanos, Cortex, M3DB and VictoriaMetrics at scale

A recent paper from Google that will probably have an impact on the future of large-scale data processing summarizes the situation facing many companies in early 2020:

“Large organizations… are dealing with exploding data volume and increasing demand for data driven applications. Broadly, these can be categorized as: reporting and dashboarding, embedded statistics in pages, time-series monitoring, and ad-hoc analysis. Typically, organizations build specialized infrastructure for each of these use cases. This, however, creates silos of data and processing, and results in a complex, expensive, and harder to maintain infrastructure.”

There are a number of interesting projects and startups addressing this problem in the open-source monitoring world, particularly for teams with dozens or hundreds of Kubernetes clusters using Prometheus for monitoring.

Prometheus, usually combined with Grafana for dashboards and reporting, has become the de facto standard for open-source monitoring in Kubernetes and one of the Cloud Native Computing Foundation’s most popular projects. (The Apache Foundation also has an open-source APM project called Skywalking that I’m told is big in China.)

A common configuration is to run a single Prometheus server per Kubernetes cluster, which is getting easier to setup and run with projects like Prometheus Operator. With dozens to hundreds of clusters in an organization—Cloudflare was running 188 in 2017—it’s complicated to maintain and slow to globally query so many Prometheuses (Prometheii?). 

There are four major open-source database projects written in Go working on this problem: Thanos, Cortex, M3DB, and VictoriaMetrics. Their goals are similar:

  • Prometheus compatibility (integrate easily with existing Kubernetes clusters)

  • Global, fast queries (see all the data from everywhere quickly)

  • Long-term historical metrics (policies to store long-term metrics cheaply)

  • High availability (resiliency against crashes without data loss)

Here’s a brief overview:

Thanos and Cortex

Thanos and Cortex are similar enough in terms of the problems they aim to solve that there was a talk at PromCom 2019 (the Prometheus conference) from Tom Wilkie with the delightful title “Two Households, Both Alike in Dignity: Cortex and Thanos”

Cortex behaves like typical SaaS monitoring where data is pushed to a central location from remote servers (but using native Prometheus APIs). Thanos is less-centralized and data remains within each Prometheus server—which is how Prometheus operates by default. These two different approaches result in different technical approaches for answering global queries like “which of my clusters is on fire?”. Tom Wilkie’s talk has more details.

The other interesting architectural aspect of both projects is they both can leverage cloud-managed services to store long-term historical data (“what was the error rate seven weeks ago?”) in order to lower operational cost.

Thanos can use cheap-ish object-stores like Amazon S3 or Azure Blob Storage, while Cortex prefers more expensive NoSQL stores like AWS DynamoDB or Google BigTable. (Cortex is now adding support for object-stores as well.) This lets teams adopting either project make deliberate choices around cost and performance for historical metrics.

It seems like there will be more collaboration between the projects in the future, and there’s a juicy twitter hot take on it.


Uber wrote their own thing, too

M3DB follows a Thanos-like model, but according to a post from the creator on Hacker News, the main issue in adopting Thanos was that Uber frequently needed historical metrics that were too slow to fetch from an object-store like Amazon S3 and there were massive bandwidth costs when historical data was moving between the cloud and Uber’s on-prem data centers.

There’s now a startup called Chronosphere that has spun off from this work and they’ve raised an $11 million Series-A round from Greylock Partners. According to TechCrunch, it will will take over management of the M3 project going forward.


Like M3DB, Thanos, and Cortex, the creators of VictoriaMetrics also faced a cost and performance problem. They were inspired by ideas from ClickHouse, a cool kid database created by Yandex that people publish extremely impressive performance and cost metrics for when they run it on a three-year old Intel hardware.

The VictoriaMetrics team, according to their FAQ, is prioritizing low operational cost and good developer/operator experience. There’s even a way to run it in a single node configuration. The performance numbers look impressive, and it seems like they also sell (or plan to sell) some kind of managed cloud service.

Coming soon to a startup near you?

At this point, you mind be asking, “why bother? I’d rather pay [SaaS vendor] and not deal with any of this.”

The interesting thing is [SaaS vendor] may very well be using one of these projects to power the oberservability solution they’re selling. Cortex, notably, is being used by Weave Cloud, Grafana Cloud, and a new enterprise service mesh called AspenMesh. Thanos was developed at (and is presumably still being used by) the gaming software company Improbable and M3DB, of course, is being used at Uber. 

Another exciting thing about these projects is that they offer a front-row seat to talented engineers solving hard distributed systems problems related to large-scale monitoring and making different decisions along the way. All this activity suggests a healthy open-source ecosystem that’s supporting a number of observability startups and providing solutions for other companies that need to run some kind of Big Prometheus. There are more high-quality open-source choices than ever.

Thanks for reading. If you enjoyed this newsletter, feel free to share using the link below or subscribe here.

Addition (since there are no comments on Substack): People in the know say you should think very, very carefully before deciding you need to run one of these solutions.

Disclosure: Opinions my own. I am not employed, consulting, or an investor in any of the mentioned companies or their competitors.

eBPF: A new BFF for Observability Startups

The newsletter this week continues looking at interesting products monitoring startups are building. After a shallow dive into observability pipelines, let’s consider eBPF: the most technical and nerdy topic in the entire monitoring space right now.

eBPF, short for extended Berkeley Packet Filters, is a Linux subsystem that has seen significant adoption in systems performance engineering and software-defined networking tools. Netflix’s Brendan Gregg, who just published a book about eBPF, has called it a “Linux superpower”.

For a small number of startups with very technical founders, eBPF is being used for flow monitoring, a type of network traffic analysis. Network data collected using eBPF, particularly when combined with metadata from Kubernetes or service meshes, opens up some interesting use cases and hints at bigger things to come—maybe.

eBPF vs Traditional Instrumentation 

If you’re not a Linux kernel hacker, there’s something fundamental about eBPF you should understand, and that’s related to how eBPF collects observability data.  

If you were going to build some kind of application monitoring solution in the past two decades, the general approach has been to write a special-sauce library (“a monitoring agent”) that runs alongside your code and instruments different events to generate metrics. It’s hard to do because of a massive number of edge cases (see here), but at a high level looks something like this:

The eBPF approach is fundamentally different: Linux engineers write programs that run safely inside the kernel to hook into and measure nearly any kind of system event. Networking-related events, like connections opening or closing, are especially well-supported. Highly (highly!) simplified, it looks like this:

With eBPF, interesting things that happen in code or in the kernel that you want to measure—like how many connections are made to an IP address—can now be captured at a lower level without special application libraries or code changes. If well-written, eBPF programs are also very performant. Facebook, apparently, now has around 40 eBPF programs running on every server.

As of late 2019, six to seven startups seem to be using eBPF for some kind of observability offering (and discussing it publicly): Flowmill, Sysdig, Weaveworks’ Weave Scope, ntop’s ntopng, Cillium’s Hubble, Datadog and Instana*. There is also an open-source Prometheus exporter from Cloudflare.

The main thing all the paid products have in common is they seem to be good at building detailed network flow maps and capturing related metrics:

Since eBPF works on any new-ish kernel version, there is also a possibility of building a more complete map of the things that do things in your cloud/datacenter, not just the latest stuff you have running in Kubernetes or services that you hand-instrumented in Rust.

Flowmill’s CEO Jonathan Perry described three broad use cases for this kind of monitoring during a recent talk

  1. Software architecture (how everything fits together with neat visualizations)

  2. Service health (figuring out what’s not working via networking metrics)

  3. Data transfer cost optimization (see Corey Quinn’s tweet on why this is so painful in the cloud)

It’s a solid list and expands on well-known eBPF security and performance tuning use-cases.

What’s next?

All of the work done in eBPF so far opens up an interesting question: eventually, will it replace most forms of agent-based instrumentation with magic bytecode that runs in the kernel and understands everything your system, applications, network and services are doing at an extremely low level that results in amazing visualizations and operational insights at hyper-cloud scale?

Not yet. But people are probably working on it. To nerd out on how it might happen, take a look at user-level dynamic tracing (uprobes) or user statically-defined tracing (USDT). Both are methods to hook into application-level events. There are some really interesting blog posts and examples and cool proof-of-concept code, like a BPF program that spies on everything you type into the shell.

There’s still a lot of superhero-level systems engineering that needs to be done. Until someone figures that out, eBPF solutions are exciting to watch for security use cases and network or flow monitoring—especially if you’re experimenting with service meshes or running lots of clusters.

That’s it for this newsletter. If you enjoyed reading, feel free to share using the link below or subscribe.

*Correction: The email edition of this newsletter left out Instana and Datadog. Instana seems to be using some eBPF functionality in technical preview according to their docs and Datadog is using a fork of Weaveworks’ TCP tracer and is using it their month-old Network monitoring solution.

Newsletter disclosure: Opinions my own. I am not employed, consulting, or an investor in any of the mentioned companies or their competitors.

Appendix: Deep Dives on eBPF Monitoring from different startup perspectives

Observability Pipelines at Kubecon

I learned about a new type of marketecture, and have a feeling it’s not going away anytime soon. It’s called the observability pipeline.

The term appears multiple times on the marketing pages of companies who exhibited in the start-up tier at Kubecon last week. This tier is the lowest-cost sponsorship and thought it’d be an interesting filter to learn about new (or thrifty) software companies. Out of the 74 startup exhibitors:

  • 10 sell something involving data or storage

  • 8 sell something involving security

  • 12 sell something involving Kubernetes cluster management, especially multi-cluster

  • 11 sell something that could be described as a monitoring or observability solution

Until recently, monitoring businesses could be categorized by the type of operational data their software collects: logs (Splunk) or metrics and traces (APM companies). This categorization still kind of works in late 2019:

A new category of monitoring startups, however, is companies that offer some kind of observability pipeline, an event-driven workflow for filtering and routing operational data. Think of it as intelligent software glue for teams with dozens of tools and services that deal with operational data, visualization and alerts.

The architecture of these pipelines can be found on the Cribl, Sensu, and NexClipper websites. You also see a similar concept without an ops-specific focus in Amazon’s EventBridge or Azure’s Event Grid. There’s also Stripe’s open-source Veneur. Everybody visualizes it in a similar way—data in, magic, data out:

A generic version of it could look something like this:

Implementing a monitoring pipeline opens up some interesting possibilities. A major use case is simplifying operational concerns around compliance and security while giving individual engineering or ops teams autonomy to extend the pipeline for different workflows, dashboards, analytics, and alerting without breaking things for others.

For example, imagine an ops team wants to have the capability to detect anomalies that could indicate “something very bad is happening” using bleeding-edge machine learning (ML). With a pipeline defined, they just have to make a change to an existing configuration and logs for specific apps are automatically re-routed a third-party service that specializes in ML (like Zebrium, another Kubecon exhibitor). If the data needs to be sanitized before sending off to the third-party, no problem: the team just attaches a new rule that removes PII data. There’s a lot of flexibility, including picking and choosing specialized tools to visualize and understand data.

Cribl, founded by ex-Splunk employees, also points out on their site that there’s potential for cost savings: you can implement policies in the pipeline so you see the most important data in real-time or convert logs into cheaper metrics. It’s still early, but emerging standards (see OpenMetrics, OpenTelemetry) could make integrating more data into pipelines easier as well.

Event-driven pipelines are not a new idea at all, but as far as I can tell it’s new to see pipelines as a distinct and highly configurable feature bundled inside of monitoring solution. If you squint, it almost looks like the beginning of control and data planes for observability.

Kubecon startups not doing pipeline stuff are equally interesting:

  • Humio, fresh off a Thoughtworks Tech Radar mention, and Chronosphere, a few weeks post-series A. Both seem to be building their products/special sauce around custom time series databases.

  • Many founders of observability startups over the past few years can describe themselves as “ex-[large technology company]”. Flowmill is what happens when the founder is a  Computer Science PhD from MIT’s CSAIL Lab and is an expert on some pretty interesting and bleeding-edge Linux kernel technology (eBPF).

  • Rookout seems seems incredibly cool from a developer perspective: click on a line of your code and get live production logs or metrics. They call it non-breaking breakpoints, and the demo on their homepage is impressive.

There’s a lot of really interesting stuff going on and want to get into the details of BPF and why it’s so fascinating in a monitoring context in the next newsletter.

Happy Thanksgiving this week for people based in the US. This post on using a bakery for explaining technology jobs to family comes in handy this time of year. 

Thanks for subscribing, and if you enjoyed this feel free to share with the link below.

Newsletter disclosure: Opinions my own. I am not employed, consulting, or an investor in any of the mentioned companies or their competitors.

Monitoring Monitoring

newsletter about the monitoring/observability startup ecosystem.

If you’re reading this, you might be interested in emerging technology companies that make products or solutions in the monitoring or observability space.

Sorry about that.

Monitoring Monitoring is a mullet of a newsletter that attempts to document the growing ecosystem of new software startups making products in this area with a split focus:

  1. Business in the front: Competitive landscape, product launches, pricing, etc.

  2. Party in the back: The technology that powers it all (custom timeseries databases written in Rust! W3C tracing standards! obscure Linux kernel stuff!)

Sign up now to get the first issue.

Obligatory social media sharing link.

Loading more posts…