Service Meshes, Kiali, and Continuous Verification

What happens when observability data meets configuration management and deploys?

This is Monitoring Monitoring, a quasi-monthly email newsletter about early-stage startups and projects in the observability space. Subscribe here.

There are good reminders that Kubernetes might not be the solution to all of our cloud software problems. In a recent tweet, Jaana Dogan compared Kubernetes to one of the deeper layers (L4) of the OSI model: container orchestration might be a foundational part of cloud native software, but it is also something that many developers and operations teams can avoid interacting with—if they pay someone else to deal with it.

So what’s a friendlier application-focused abstraction if you don’t want to get lost in the deep weeds of Kubernetes and you’re not on the serverless bandwagon? In early 2020, your answer might be some kind of service mesh.

A service mesh manages how microservices (macroservices?) communicate with each other, including what happens when new software gets deployed. There are good security and networking use cases, but the focus of this newsletter this week is what happens when the service mesh meets monitoring.

Service mesh-palooza

The open-source ecosystem around service meshes is thriving. Istio, linkerd, and Envoy have spawned a number of paid and free solutions. Both AWS (AWS App Mesh), Azure (Azure Service Fabric Mesh) have based their managed solutions on Envoy. Google Cloud (Traffic Director) has a solution based on Istio. There is also a fascinating open-source project tailored for financial institutions called SOFAStack that seems to be powering (parts of) large banks in China.

In terms of enterprise SaaS meshes, there’s also Aspen Mesh (incubated by F5, based on Istio), Tetrate (based on Istio and Envoy), and Banzai Cloud’s Backyards.

When it comes to monitoring, the interesting aspect of service meshes is they enable observability data that were previously difficult to collect as a built-in feature of using the mesh with any kind of service. Traditional metrics (like request rate) and traces are labeled with the context of the workloads generating that request, cool visualizations included.

Enter Kiali, a service mesh meets observability project for Istio that seems to be supported by Red Hat/IBM. The interesting twist with Kiali is in the tagline: “Service mesh observability and configuration”.

Configuration meets observability

There are several compelling ideas in Kiali, including emerging methods for correlating observability data. As Kiali—and most major APM/logging vendors—provide solutions that combine metrics and traces, there are different technical approaches involved in combining “the three pillars of observability” in a single interface to enable easy troubleshooting. See this technical talk from Chronosphere cofounder Rob Skillington on deep-linking metrics and traces for how this is done.

Another concept in Kiali is how it uses its privileged position in your infrastructure—it sees all service-to-service communication in the mesh—to help teams manage complex configuration using two different approaches:

The general idea is if problems are detected, it’s possible to fix them immediately inside Kiali. Metrics are combined with awareness of runtime configuration to answer questions and fix problems in the same tool with human intervention. It’s a cool idea (certified dope by Kelsey Hightower), and a new concept that integrates significant configuration changes directly inside an observability project.

Deploys meet observability

Like diagnosing a problem caused by configuration, observability data has always been central in determining if a deploy was successful. A popular solution is Kubernetes-based continuous deployment software called Spinnaker that integrates with the usual monitoring vendors. Spinnaker is also available as enterprise-focused SaaS from Armory, OpsMx, and Mirantis

Verica, founded by author of the recently-published Chaos Engineering book Casey Rosenthal, extends this idea further with continuous verification (CV). CV is a method of proactively identifying issues in a complex system using an automated system that verifies assumptions about a service—effectively what happens when you fully automate well-designed chaos engineering experiments, as Netflix did with their Chaos Automation Platform ChAP.

In a 2019 blog post and conference talk, Casey argues that CV techniques are the evolution of what teams have learned with continuous integration and delivery. 

Continuous verification features also seem to be publicly available in the continuous delivery as a service startup Harness. Harness achieves this via integration with APM and logging vendors with some machine learning sprinkled on top.

The promising future of using observability data in new and clever ways

There’s a lot of attention on startups focused on the core problems of observability right now—different technical approaches that process, store, instrument, collect, and visualize data. Many of these emerging techniques have been covered in this newsletter, from eBPF and Prometheus databases, to observability pipelines.

However, some of the most compelling startups and projects right now are exploring what happens when you use observability data in clever ways to solve problems facing technical teams as they ship software—even if they decide they don’t need to deal with Kubernetes. New ways to safely deploy software, proactively identifty issues in a complex system, or detect configuration errors are just a few possibilities. There will be many more.

Thanks for reading. If you enjoyed this newsletter, feel free to share using the link below or subscribe here.

Disclosure: Opinions my own. I am not employed, consulting, or an investor in any of the mentioned companies or their competitors.

Coding and Tracing Workflow Remix (feat. Dark)

Startups and projects blurring the boundaries between development and monitoring

This is Monitoring Monitoring, a quasi-monthly newsletter about early-stage startups in the observability space. Subscribe here.

The issue this month is about startups and projects that are blurring the boundaries between development and monitoring—from debugging Kubernetes clusters (Squash and Telepresence), instrumentation controlled by cloud-based code breakpoints (Rookout), or a new programming language and infrastructure (Dark).

Software engineers have different toolchains and workflows for coding, testing and monitoring. These companies and projects are bringing these workflows closer together (or reinventing them) to make building and operating modern backend services easier.

Kubernetes Debugging with Clever Proxies

The Kubernetes meme ecosystem is exploding, but classics involve developers reacting to getting started with Kubernetes. One particular pain point for devs is traditional debugging—setting a breakpoint and tracing execution through lines of code in a favorite IDE—becomes impossible with complex cloud-based services that can’t run on your laptop.

Two open-source projects, Squash (created by solo.io) and Telepresence (created by Datawire), use special proxies that hook into remote clusters to enable debugging with the usual tools and IDEs. It’s a developer-specific workflow complimentary to monitoring, where, in the words of Squash’s documentation, the feedback loop might be too slow:

Certain tools exist for troubleshooting microservice issues. OpenTracing can be used to produce transaction or workflow logs for post-mortem analysis. Service meshes like Istio can be used to monitor the network to identify latency problems. Unfortunately, these tools are passive, the feedback loop is slow, and they do not allow you to monitor and alter the application during run time.

There are some clever networking tricks that make this kind of remote debugging possible, including a novel use of Envoy proxy to filter special debugging requests in Squash or various VPN tunneling and shared library hacks in Telepresence. Telepresence also has some additional features like live coding a service (that is hopefully not running in production).

If you blended some of the debugging use cases of Squash and Telepresence with a modern APM solution and cloud IDE, you might get something like Rookout (previously mentioned in the Kubecon 2019 recap).

The ‘Responsive Code-Data Layer’ with Rookout

As Squash’s about page says, most monitoring solutions are passive: if you need new information, you must deploy a change (code, agents, plugins, config files, or libraries) and get back additional telemetry. There are some emerging techniques to instrument code (eBPF, covered previously), but gathering new data generally requires a new deployment.

Rookout offers an alternative without requiring a deploy to get new logs or metrics: if you want to measure something new, just click a line of your code in their cloud-based user interface. They call it the “responsive code-data layer” on their marketing page.

There is an interactive sandbox available for testing, but visually the idea is clear to developers—it’s an IDE-like environment where you set breakpoints to collect new data and control observability in real-time:

This kind of product opens up several new use cases. One of my favorites is called “sustainable logging”, where you can reduce the volume of your log messages (i.e. Splunk bill).

If you combine observability, an IDE, a new programming language, workflow, and managed infrastructure, we might end up with Dark, a new backend-as-a-service startup.

Trace-Driven Development with Dark

The most compelling phrase in Dark's private beta documentation might be trace driven development: you “send requests to Dark before writing code.” Trace-driven development isn’t a new idea but hasn’t seemed to break into the mainstream, either. (Ted Young of Lightstephad a great talk on it in 2018.)

Ted’s talk and an obscure Erlang message board post from 2012 echo an idea that Dark seems to be exploring in their product: how can traces result in a faster development and testing feedback loop and higher quality, working software? In Dark, a developer is constantly using traces that start with end-user requests to develop and integrate code.

Consider the example below, which is a screenshot of a simple app in the Dark IDE (named “the canvas”). Traces are the clickable dots to the left of the HTTP handler and each end-user request is represented by a different dot that can be selected and replayed:

Want to change the code to handle a new kind of query parameter named “foo”? First, make a HTTP request that contains that new parameter and then start coding to handle it. As you type, everything automatically updates without a deploy.

Dark seems to be the first kind of a development environment that puts end-user traces front-and-center. It also pretty much explodes every other convention around backend service development.

Code-centric Monitoring

Despite the perception that monitoring solutions are only used by developers when something goes wrong (see last issue), there has always been a slow feedback loop from understanding collected telemetry to code changes. These projects and companies offer different solutions designed around tracing code execution to improve this:

  • Connect local debugging workflows and tools to distributed systems (Squash, Telepresence) 

  • Integrate an IDE-like breakpoint experience with an observability solution (Rookout)

  • Redefine the entire backend development workflow and toolchain (Dark)

All of the above also suggest a more code-centric and active approach to monitoring—the central user interface is interacting with your own code, not a dashboard or query interface. For backend developers, new approaches that offer relief from the status quo seem welcome.

Thanks for reading. If you enjoyed this newsletter, feel free to share using the link below or subscribe here.

Disclosure: Opinions my own. I am not employed, consulting, or an investor in any of the mentioned companies or their competitors.

Chaos, Complexity and Catalogs

Startups helping operate and understand complex systems through chaos engineering or service catalogs

Welcome to all the new subscribers who like databases. This is Monitoring Monitoring, a quasi-monthly newsletter about startups in the observability space. You can subscribe here.

Monitoring is most appreciated when something breaks. Despite a lot of marketing (and snake oil) around AI, it’s still up to human operators to find a fix.

The lessons from a physician in a resilience engineering cult favorite paper from the nineties still resonate with people building software today. Consider these two points:

12. Human practitioners are the adaptable element of complex systems. 

18. Failure free operations require experience with failure.

The messy intersection of people, organizations, and complex (always failing) software is the focus of this newsletter. If we use monitoring, observability and dashboards to answer the question “what’s broken?”, here are two more questions:  

How do we break it? (#18) and Who owns it? (#12) 

Chaos as a Service, or “How do we break it?”

In 2011, Netflix had an idea to test a system’s ability to tolerate failures by deliberately injecting failures using a tool called Chaos Monkey. This was extremely influential, and nine years later, resilience and chaos engineering are about half of the sessions at SREcon in March and there’s now the equivalent of an Agile Manifesto for Chaos.

Gremlin (founded 2016) and ChaosIQ (founded 2017) are selling SaaS solutions to run chaos experiments and there is a long list of open-source tools in the space. My favorite project name is Bloomberg’s PowerfulSeal.

Gremlin and ChaosIQ have several integrations to run chaos experiments in different parts of the stack (like the network, cloud provider, or application). The idea, like Chaos Monkey, is you can deliberately create failure scenarios and learn what happened to build more resilient systems.

There are a couple different ways chaos solutions fit into monitoring (other using logs and dashboard to figure out what broke during an experiment). Gremlin has support to overlay chaos events on top of Datadog dashboards, while ChaosIQ has integrations for Prometheus and OpenTracing. The OpenTracing integration gets at a cool idea where running an experiment changes the observability of your system: when you start the experiment, detailed traces can be automatically collected.

Before you run chaos experiments, however, it’s a good idea to know what your services actually do and who owns them.

Actually Your Problem as a Service, or “Who owns it?”

There’s a very human problem related when an organization enters the late-stage microservice phase: it’s no longer clear from dashboards or internal documentation who owns a service or what a given service does—especially if the original owner was promoted (or left the company). Service naming can complicate understanding (see AWS Systems Manager Session Manager) and service metadata, in any case, are usually spread across clusters, data centers, and dashboards. There’s probably something clever to say about Conway’s Law here.

Enter OpsLevel, effx, Dive and GitCatalog. The key words on their marketing pages are “owned”, “tame”, “structured”, and “up-to-date”. Most provide integrations into different types of systems (clusters, CI/CD, alerting, messaging) so the service catalog updates automatically and doesn’t get stale on an outdated internal wiki page. According to Joey Parsons, founder of effx, the magic number where you start to feel human-to-services pain is around 20.

Detailed metrics and logs might be available for all your services, but these startups remind us that that doesn’t matter much if you can’t find the right person or documentation to interpret the data. 

More systems, more human problems

The work these startups are doing is a humbling reminder that regardless of really interesting emerging technology in the observability space, there’s still a massive amount of work to be done to connect the right people with systems and help them deal with (and anticipate) failure scenarios. 

The intersection of academics, developers, SREs and operators tackling these problems is one of the most interesting interdisciplinary groups thinking about what happens when humans meet software complexity. There is a very long list of hard problems related to understanding and operating the systems we build: “who owns it?” and “how do we break it?” are just two that involve a fast-growing number of companies, consultancies, and engineers.

Thanks for reading. If you enjoyed this newsletter, feel free to share using the link below or subscribe here.

Disclosure: Opinions my own. I am not employed, consulting, or an investor in any of the mentioned companies or their competitors.

See Also

Big Prometheus

Thanos, Cortex, M3DB and VictoriaMetrics at scale

A recent paper from Google that will probably have an impact on the future of large-scale data processing summarizes the situation facing many companies in early 2020:

“Large organizations… are dealing with exploding data volume and increasing demand for data driven applications. Broadly, these can be categorized as: reporting and dashboarding, embedded statistics in pages, time-series monitoring, and ad-hoc analysis. Typically, organizations build specialized infrastructure for each of these use cases. This, however, creates silos of data and processing, and results in a complex, expensive, and harder to maintain infrastructure.”

There are a number of interesting projects and startups addressing this problem in the open-source monitoring world, particularly for teams with dozens or hundreds of Kubernetes clusters using Prometheus for monitoring.

Prometheus, usually combined with Grafana for dashboards and reporting, has become the de facto standard for open-source monitoring in Kubernetes and one of the Cloud Native Computing Foundation’s most popular projects. (The Apache Foundation also has an open-source APM project called Skywalking that I’m told is big in China.)

A common configuration is to run a single Prometheus server per Kubernetes cluster, which is getting easier to setup and run with projects like Prometheus Operator. With dozens to hundreds of clusters in an organization—Cloudflare was running 188 in 2017—it’s complicated to maintain and slow to globally query so many Prometheuses (Prometheii?). 

There are four major open-source database projects written in Go working on this problem: Thanos, Cortex, M3DB, and VictoriaMetrics. Their goals are similar:

  • Prometheus compatibility (integrate easily with existing Kubernetes clusters)

  • Global, fast queries (see all the data from everywhere quickly)

  • Long-term historical metrics (policies to store long-term metrics cheaply)

  • High availability (resiliency against crashes without data loss)

Here’s a brief overview:

Thanos and Cortex

Thanos and Cortex are similar enough in terms of the problems they aim to solve that there was a talk at PromCom 2019 (the Prometheus conference) from Tom Wilkie with the delightful title “Two Households, Both Alike in Dignity: Cortex and Thanos”

Cortex behaves like typical SaaS monitoring where data is pushed to a central location from remote servers (but using native Prometheus APIs). Thanos is less-centralized and data remains within each Prometheus server—which is how Prometheus operates by default. These two different approaches result in different technical approaches for answering global queries like “which of my clusters is on fire?”. Tom Wilkie’s talk has more details.

The other interesting architectural aspect of both projects is they both can leverage cloud-managed services to store long-term historical data (“what was the error rate seven weeks ago?”) in order to lower operational cost.

Thanos can use cheap-ish object-stores like Amazon S3 or Azure Blob Storage, while Cortex prefers more expensive NoSQL stores like AWS DynamoDB or Google BigTable. (Cortex is now adding support for object-stores as well.) This lets teams adopting either project make deliberate choices around cost and performance for historical metrics.

It seems like there will be more collaboration between the projects in the future, and there’s a juicy twitter hot take on it.

M3DB

Uber wrote their own thing, too

M3DB follows a Thanos-like model, but according to a post from the creator on Hacker News, the main issue in adopting Thanos was that Uber frequently needed historical metrics that were too slow to fetch from an object-store like Amazon S3 and there were massive bandwidth costs when historical data was moving between the cloud and Uber’s on-prem data centers.

There’s now a startup called Chronosphere that has spun off from this work and they’ve raised an $11 million Series-A round from Greylock Partners. According to TechCrunch, it will will take over management of the M3 project going forward.

VictoriaMetrics

Like M3DB, Thanos, and Cortex, the creators of VictoriaMetrics also faced a cost and performance problem. They were inspired by ideas from ClickHouse, a cool kid database created by Yandex that people publish extremely impressive performance and cost metrics for when they run it on a three-year old Intel hardware.

The VictoriaMetrics team, according to their FAQ, is prioritizing low operational cost and good developer/operator experience. There’s even a way to run it in a single node configuration. The performance numbers look impressive, and it seems like they also sell (or plan to sell) some kind of managed cloud service.

Coming soon to a startup near you?

At this point, you mind be asking, “why bother? I’d rather pay [SaaS vendor] and not deal with any of this.”

The interesting thing is [SaaS vendor] may very well be using one of these projects to power the oberservability solution they’re selling. Cortex, notably, is being used by Weave Cloud, Grafana Cloud, and a new enterprise service mesh called AspenMesh. Thanos was developed at (and is presumably still being used by) the gaming software company Improbable and M3DB, of course, is being used at Uber. 

Another exciting thing about these projects is that they offer a front-row seat to talented engineers solving hard distributed systems problems related to large-scale monitoring and making different decisions along the way. All this activity suggests a healthy open-source ecosystem that’s supporting a number of observability startups and providing solutions for other companies that need to run some kind of Big Prometheus. There are more high-quality open-source choices than ever.

Thanks for reading. If you enjoyed this newsletter, feel free to share using the link below or subscribe here.

Addition (since there are no comments on Substack): People in the know say you should think very, very carefully before deciding you need to run one of these solutions.

Disclosure: Opinions my own. I am not employed, consulting, or an investor in any of the mentioned companies or their competitors.

eBPF: A new BFF for Observability Startups

The newsletter this week continues looking at interesting products monitoring startups are building. After a shallow dive into observability pipelines, let’s consider eBPF: the most technical and nerdy topic in the entire monitoring space right now.

eBPF, short for extended Berkeley Packet Filters, is a Linux subsystem that has seen significant adoption in systems performance engineering and software-defined networking tools. Netflix’s Brendan Gregg, who just published a book about eBPF, has called it a “Linux superpower”.

For a small number of startups with very technical founders, eBPF is being used for flow monitoring, a type of network traffic analysis. Network data collected using eBPF, particularly when combined with metadata from Kubernetes or service meshes, opens up some interesting use cases and hints at bigger things to come—maybe.

eBPF vs Traditional Instrumentation 

If you’re not a Linux kernel hacker, there’s something fundamental about eBPF you should understand, and that’s related to how eBPF collects observability data.  

If you were going to build some kind of application monitoring solution in the past two decades, the general approach has been to write a special-sauce library (“a monitoring agent”) that runs alongside your code and instruments different events to generate metrics. It’s hard to do because of a massive number of edge cases (see here), but at a high level looks something like this:

The eBPF approach is fundamentally different: Linux engineers write programs that run safely inside the kernel to hook into and measure nearly any kind of system event. Networking-related events, like connections opening or closing, are especially well-supported. Highly (highly!) simplified, it looks like this:

With eBPF, interesting things that happen in code or in the kernel that you want to measure—like how many connections are made to an IP address—can now be captured at a lower level without special application libraries or code changes. If well-written, eBPF programs are also very performant. Facebook, apparently, now has around 40 eBPF programs running on every server.

As of late 2019, six to seven startups seem to be using eBPF for some kind of observability offering (and discussing it publicly): Flowmill, Sysdig, Weaveworks’ Weave Scope, ntop’s ntopng, Cillium’s Hubble, Datadog and Instana*. There is also an open-source Prometheus exporter from Cloudflare.

The main thing all the paid products have in common is they seem to be good at building detailed network flow maps and capturing related metrics:

Since eBPF works on any new-ish kernel version, there is also a possibility of building a more complete map of the things that do things in your cloud/datacenter, not just the latest stuff you have running in Kubernetes or services that you hand-instrumented in Rust.

Flowmill’s CEO Jonathan Perry described three broad use cases for this kind of monitoring during a recent talk

  1. Software architecture (how everything fits together with neat visualizations)

  2. Service health (figuring out what’s not working via networking metrics)

  3. Data transfer cost optimization (see Corey Quinn’s tweet on why this is so painful in the cloud)

It’s a solid list and expands on well-known eBPF security and performance tuning use-cases.

What’s next?

All of the work done in eBPF so far opens up an interesting question: eventually, will it replace most forms of agent-based instrumentation with magic bytecode that runs in the kernel and understands everything your system, applications, network and services are doing at an extremely low level that results in amazing visualizations and operational insights at hyper-cloud scale?

Not yet. But people are probably working on it. To nerd out on how it might happen, take a look at user-level dynamic tracing (uprobes) or user statically-defined tracing (USDT). Both are methods to hook into application-level events. There are some really interesting blog posts and examples and cool proof-of-concept code, like a BPF program that spies on everything you type into the shell.

There’s still a lot of superhero-level systems engineering that needs to be done. Until someone figures that out, eBPF solutions are exciting to watch for security use cases and network or flow monitoring—especially if you’re experimenting with service meshes or running lots of clusters.

That’s it for this newsletter. If you enjoyed reading, feel free to share using the link below or subscribe.

*Correction: The email edition of this newsletter left out Instana and Datadog. Instana seems to be using some eBPF functionality in technical preview according to their docs and Datadog is using a fork of Weaveworks’ TCP tracer and is using it their month-old Network monitoring solution.

Newsletter disclosure: Opinions my own. I am not employed, consulting, or an investor in any of the mentioned companies or their competitors.

Appendix: Deep Dives on eBPF Monitoring from different startup perspectives

Loading more posts…