Coding and Tracing Workflow Remix (feat. Dark)

Startups and projects blurring the boundaries between development and monitoring

This is Monitoring Monitoring, a quasi-monthly newsletter about early-stage startups in the observability space. Subscribe here.

The issue this month is about startups and projects that are blurring the boundaries between development and monitoring—from debugging Kubernetes clusters (Squash and Telepresence), instrumentation controlled by cloud-based code breakpoints (Rookout), or a new programming language and infrastructure (Dark).

Software engineers have different toolchains and workflows for coding, testing and monitoring. These companies and projects are bringing these workflows closer together (or reinventing them) to make building and operating modern backend services easier.

Kubernetes Debugging with Clever Proxies

The Kubernetes meme ecosystem is exploding, but classics involve developers reacting to getting started with Kubernetes. One particular pain point for devs is traditional debugging—setting a breakpoint and tracing execution through lines of code in a favorite IDE—becomes impossible with complex cloud-based services that can’t run on your laptop.

Two open-source projects, Squash (created by solo.io) and Telepresence (created by Datawire), use special proxies that hook into remote clusters to enable debugging with the usual tools and IDEs. It’s a developer-specific workflow complimentary to monitoring, where, in the words of Squash’s documentation, the feedback loop might be too slow:

Certain tools exist for troubleshooting microservice issues. OpenTracing can be used to produce transaction or workflow logs for post-mortem analysis. Service meshes like Istio can be used to monitor the network to identify latency problems. Unfortunately, these tools are passive, the feedback loop is slow, and they do not allow you to monitor and alter the application during run time.

There are some clever networking tricks that make this kind of remote debugging possible, including a novel use of Envoy proxy to filter special debugging requests in Squash or various VPN tunneling and shared library hacks in Telepresence. Telepresence also has some additional features like live coding a service (that is hopefully not running in production).

If you blended some of the debugging use cases of Squash and Telepresence with a modern APM solution and cloud IDE, you might get something like Rookout (previously mentioned in the Kubecon 2019 recap).

The ‘Responsive Code-Data Layer’ with Rookout

As Squash’s about page says, most monitoring solutions are passive: if you need new information, you must deploy a change (code, agents, plugins, config files, or libraries) and get back additional telemetry. There are some emerging techniques to instrument code (eBPF, covered previously), but gathering new data generally requires a new deployment.

Rookout offers an alternative without requiring a deploy to get new logs or metrics: if you want to measure something new, just click a line of your code in their cloud-based user interface. They call it the “responsive code-data layer” on their marketing page.

There is an interactive sandbox available for testing, but visually the idea is clear to developers—it’s an IDE-like environment where you set breakpoints to collect new data and control observability in real-time:

This kind of product opens up several new use cases. One of my favorites is called “sustainable logging”, where you can reduce the volume of your log messages (i.e. Splunk bill).

If you combine observability, an IDE, a new programming language, workflow, and managed infrastructure, we might end up with Dark, a new backend-as-a-service startup.

Trace-Driven Development with Dark

The most compelling phrase in Dark's private beta documentation might be trace driven development: you “send requests to Dark before writing code.” Trace-driven development isn’t a new idea but hasn’t seemed to break into the mainstream, either. (Ted Young of Lightstephad a great talk on it in 2018.)

Ted’s talk and an obscure Erlang message board post from 2012 echo an idea that Dark seems to be exploring in their product: how can traces result in a faster development and testing feedback loop and higher quality, working software? In Dark, a developer is constantly using traces that start with end-user requests to develop and integrate code.

Consider the example below, which is a screenshot of a simple app in the Dark IDE (named “the canvas”). Traces are the clickable dots to the left of the HTTP handler and each end-user request is represented by a different dot that can be selected and replayed:

Want to change the code to handle a new kind of query parameter named “foo”? First, make a HTTP request that contains that new parameter and then start coding to handle it. As you type, everything automatically updates without a deploy.

Dark seems to be the first kind of a development environment that puts end-user traces front-and-center. It also pretty much explodes every other convention around backend service development.

Code-centric Monitoring

Despite the perception that monitoring solutions are only used by developers when something goes wrong (see last issue), there has always been a slow feedback loop from understanding collected telemetry to code changes. These projects and companies offer different solutions designed around tracing code execution to improve this:

  • Connect local debugging workflows and tools to distributed systems (Squash, Telepresence) 

  • Integrate an IDE-like breakpoint experience with an observability solution (Rookout)

  • Redefine the entire backend development workflow and toolchain (Dark)

All of the above also suggest a more code-centric and active approach to monitoring—the central user interface is interacting with your own code, not a dashboard or query interface. For backend developers, new approaches that offer relief from the status quo seem welcome.

Thanks for reading. If you enjoyed this newsletter, feel free to share using the link below or subscribe here.

Disclosure: Opinions my own. I am not employed, consulting, or an investor in any of the mentioned companies or their competitors.

Chaos, Complexity and Catalogs

Startups helping operate and understand complex systems through chaos engineering or service catalogs

Welcome to all the new subscribers who like databases. This is Monitoring Monitoring, a quasi-monthly newsletter about startups in the observability space. You can subscribe here.

Monitoring is most appreciated when something breaks. Despite a lot of marketing (and snake oil) around AI, it’s still up to human operators to find a fix.

The lessons from a physician in a resilience engineering cult favorite paper from the nineties still resonate with people building software today. Consider these two points:

12. Human practitioners are the adaptable element of complex systems. 

18. Failure free operations require experience with failure.

The messy intersection of people, organizations, and complex (always failing) software is the focus of this newsletter. If we use monitoring, observability and dashboards to answer the question “what’s broken?”, here are two more questions:  

How do we break it? (#18) and Who owns it? (#12) 

Chaos as a Service, or “How do we break it?”

In 2011, Netflix had an idea to test a system’s ability to tolerate failures by deliberately injecting failures using a tool called Chaos Monkey. This was extremely influential, and nine years later, resilience and chaos engineering are about half of the sessions at SREcon in March and there’s now the equivalent of an Agile Manifesto for Chaos.

Gremlin (founded 2016) and ChaosIQ (founded 2017) are selling SaaS solutions to run chaos experiments and there is a long list of open-source tools in the space. My favorite project name is Bloomberg’s PowerfulSeal.

Gremlin and ChaosIQ have several integrations to run chaos experiments in different parts of the stack (like the network, cloud provider, or application). The idea, like Chaos Monkey, is you can deliberately create failure scenarios and learn what happened to build more resilient systems.

There are a couple different ways chaos solutions fit into monitoring (other using logs and dashboard to figure out what broke during an experiment). Gremlin has support to overlay chaos events on top of Datadog dashboards, while ChaosIQ has integrations for Prometheus and OpenTracing. The OpenTracing integration gets at a cool idea where running an experiment changes the observability of your system: when you start the experiment, detailed traces can be automatically collected.

Before you run chaos experiments, however, it’s a good idea to know what your services actually do and who owns them.

Actually Your Problem as a Service, or “Who owns it?”

There’s a very human problem related when an organization enters the late-stage microservice phase: it’s no longer clear from dashboards or internal documentation who owns a service or what a given service does—especially if the original owner was promoted (or left the company). Service naming can complicate understanding (see AWS Systems Manager Session Manager) and service metadata, in any case, are usually spread across clusters, data centers, and dashboards. There’s probably something clever to say about Conway’s Law here.

Enter OpsLevel, effx, Dive and GitCatalog. The key words on their marketing pages are “owned”, “tame”, “structured”, and “up-to-date”. Most provide integrations into different types of systems (clusters, CI/CD, alerting, messaging) so the service catalog updates automatically and doesn’t get stale on an outdated internal wiki page. According to Joey Parsons, founder of effx, the magic number where you start to feel human-to-services pain is around 20.

Detailed metrics and logs might be available for all your services, but these startups remind us that that doesn’t matter much if you can’t find the right person or documentation to interpret the data. 

More systems, more human problems

The work these startups are doing is a humbling reminder that regardless of really interesting emerging technology in the observability space, there’s still a massive amount of work to be done to connect the right people with systems and help them deal with (and anticipate) failure scenarios. 

The intersection of academics, developers, SREs and operators tackling these problems is one of the most interesting interdisciplinary groups thinking about what happens when humans meet software complexity. There is a very long list of hard problems related to understanding and operating the systems we build: “who owns it?” and “how do we break it?” are just two that involve a fast-growing number of companies, consultancies, and engineers.

Thanks for reading. If you enjoyed this newsletter, feel free to share using the link below or subscribe here.

Disclosure: Opinions my own. I am not employed, consulting, or an investor in any of the mentioned companies or their competitors.

See Also

Big Prometheus

Thanos, Cortex, M3DB and VictoriaMetrics at scale

A recent paper from Google that will probably have an impact on the future of large-scale data processing summarizes the situation facing many companies in early 2020:

“Large organizations… are dealing with exploding data volume and increasing demand for data driven applications. Broadly, these can be categorized as: reporting and dashboarding, embedded statistics in pages, time-series monitoring, and ad-hoc analysis. Typically, organizations build specialized infrastructure for each of these use cases. This, however, creates silos of data and processing, and results in a complex, expensive, and harder to maintain infrastructure.”

There are a number of interesting projects and startups addressing this problem in the open-source monitoring world, particularly for teams with dozens or hundreds of Kubernetes clusters using Prometheus for monitoring.

Prometheus, usually combined with Grafana for dashboards and reporting, has become the de facto standard for open-source monitoring in Kubernetes and one of the Cloud Native Computing Foundation’s most popular projects. (The Apache Foundation also has an open-source APM project called Skywalking that I’m told is big in China.)

A common configuration is to run a single Prometheus server per Kubernetes cluster, which is getting easier to setup and run with projects like Prometheus Operator. With dozens to hundreds of clusters in an organization—Cloudflare was running 188 in 2017—it’s complicated to maintain and slow to globally query so many Prometheuses (Prometheii?). 

There are four major open-source database projects written in Go working on this problem: Thanos, Cortex, M3DB, and VictoriaMetrics. Their goals are similar:

  • Prometheus compatibility (integrate easily with existing Kubernetes clusters)

  • Global, fast queries (see all the data from everywhere quickly)

  • Long-term historical metrics (policies to store long-term metrics cheaply)

  • High availability (resiliency against crashes without data loss)

Here’s a brief overview:

Thanos and Cortex

Thanos and Cortex are similar enough in terms of the problems they aim to solve that there was a talk at PromCom 2019 (the Prometheus conference) from Tom Wilkie with the delightful title “Two Households, Both Alike in Dignity: Cortex and Thanos”

Cortex behaves like typical SaaS monitoring where data is pushed to a central location from remote servers (but using native Prometheus APIs). Thanos is less-centralized and data remains within each Prometheus server—which is how Prometheus operates by default. These two different approaches result in different technical approaches for answering global queries like “which of my clusters is on fire?”. Tom Wilkie’s talk has more details.

The other interesting architectural aspect of both projects is they both can leverage cloud-managed services to store long-term historical data (“what was the error rate seven weeks ago?”) in order to lower operational cost.

Thanos can use cheap-ish object-stores like Amazon S3 or Azure Blob Storage, while Cortex prefers more expensive NoSQL stores like AWS DynamoDB or Google BigTable. (Cortex is now adding support for object-stores as well.) This lets teams adopting either project make deliberate choices around cost and performance for historical metrics.

It seems like there will be more collaboration between the projects in the future, and there’s a juicy twitter hot take on it.

M3DB

Uber wrote their own thing, too

M3DB follows a Thanos-like model, but according to a post from the creator on Hacker News, the main issue in adopting Thanos was that Uber frequently needed historical metrics that were too slow to fetch from an object-store like Amazon S3 and there were massive bandwidth costs when historical data was moving between the cloud and Uber’s on-prem data centers.

There’s now a startup called Chronosphere that has spun off from this work and they’ve raised an $11 million Series-A round from Greylock Partners. According to TechCrunch, it will will take over management of the M3 project going forward.

VictoriaMetrics

Like M3DB, Thanos, and Cortex, the creators of VictoriaMetrics also faced a cost and performance problem. They were inspired by ideas from ClickHouse, a cool kid database created by Yandex that people publish extremely impressive performance and cost metrics for when they run it on a three-year old Intel hardware.

The VictoriaMetrics team, according to their FAQ, is prioritizing low operational cost and good developer/operator experience. There’s even a way to run it in a single node configuration. The performance numbers look impressive, and it seems like they also sell (or plan to sell) some kind of managed cloud service.

Coming soon to a startup near you?

At this point, you mind be asking, “why bother? I’d rather pay [SaaS vendor] and not deal with any of this.”

The interesting thing is [SaaS vendor] may very well be using one of these projects to power the oberservability solution they’re selling. Cortex, notably, is being used by Weave Cloud, Grafana Cloud, and a new enterprise service mesh called AspenMesh. Thanos was developed at (and is presumably still being used by) the gaming software company Improbable and M3DB, of course, is being used at Uber. 

Another exciting thing about these projects is that they offer a front-row seat to talented engineers solving hard distributed systems problems related to large-scale monitoring and making different decisions along the way. All this activity suggests a healthy open-source ecosystem that’s supporting a number of observability startups and providing solutions for other companies that need to run some kind of Big Prometheus. There are more high-quality open-source choices than ever.

Thanks for reading. If you enjoyed this newsletter, feel free to share using the link below or subscribe here.

Addition (since there are no comments on Substack): People in the know say you should think very, very carefully before deciding you need to run one of these solutions.

Disclosure: Opinions my own. I am not employed, consulting, or an investor in any of the mentioned companies or their competitors.

eBPF: A new BFF for Observability Startups

The newsletter this week continues looking at interesting products monitoring startups are building. After a shallow dive into observability pipelines, let’s consider eBPF: the most technical and nerdy topic in the entire monitoring space right now.

eBPF, short for extended Berkeley Packet Filters, is a Linux subsystem that has seen significant adoption in systems performance engineering and software-defined networking tools. Netflix’s Brendan Gregg, who just published a book about eBPF, has called it a “Linux superpower”.

For a small number of startups with very technical founders, eBPF is being used for flow monitoring, a type of network traffic analysis. Network data collected using eBPF, particularly when combined with metadata from Kubernetes or service meshes, opens up some interesting use cases and hints at bigger things to come—maybe.

eBPF vs Traditional Instrumentation 

If you’re not a Linux kernel hacker, there’s something fundamental about eBPF you should understand, and that’s related to how eBPF collects observability data.  

If you were going to build some kind of application monitoring solution in the past two decades, the general approach has been to write a special-sauce library (“a monitoring agent”) that runs alongside your code and instruments different events to generate metrics. It’s hard to do because of a massive number of edge cases (see here), but at a high level looks something like this:

The eBPF approach is fundamentally different: Linux engineers write programs that run safely inside the kernel to hook into and measure nearly any kind of system event. Networking-related events, like connections opening or closing, are especially well-supported. Highly (highly!) simplified, it looks like this:

With eBPF, interesting things that happen in code or in the kernel that you want to measure—like how many connections are made to an IP address—can now be captured at a lower level without special application libraries or code changes. If well-written, eBPF programs are also very performant. Facebook, apparently, now has around 40 eBPF programs running on every server.

As of late 2019, six to seven startups seem to be using eBPF for some kind of observability offering (and discussing it publicly): Flowmill, Sysdig, Weaveworks’ Weave Scope, ntop’s ntopng, Cillium’s Hubble, Datadog and Instana*. There is also an open-source Prometheus exporter from Cloudflare.

The main thing all the paid products have in common is they seem to be good at building detailed network flow maps and capturing related metrics:

Since eBPF works on any new-ish kernel version, there is also a possibility of building a more complete map of the things that do things in your cloud/datacenter, not just the latest stuff you have running in Kubernetes or services that you hand-instrumented in Rust.

Flowmill’s CEO Jonathan Perry described three broad use cases for this kind of monitoring during a recent talk

  1. Software architecture (how everything fits together with neat visualizations)

  2. Service health (figuring out what’s not working via networking metrics)

  3. Data transfer cost optimization (see Corey Quinn’s tweet on why this is so painful in the cloud)

It’s a solid list and expands on well-known eBPF security and performance tuning use-cases.

What’s next?

All of the work done in eBPF so far opens up an interesting question: eventually, will it replace most forms of agent-based instrumentation with magic bytecode that runs in the kernel and understands everything your system, applications, network and services are doing at an extremely low level that results in amazing visualizations and operational insights at hyper-cloud scale?

Not yet. But people are probably working on it. To nerd out on how it might happen, take a look at user-level dynamic tracing (uprobes) or user statically-defined tracing (USDT). Both are methods to hook into application-level events. There are some really interesting blog posts and examples and cool proof-of-concept code, like a BPF program that spies on everything you type into the shell.

There’s still a lot of superhero-level systems engineering that needs to be done. Until someone figures that out, eBPF solutions are exciting to watch for security use cases and network or flow monitoring—especially if you’re experimenting with service meshes or running lots of clusters.

That’s it for this newsletter. If you enjoyed reading, feel free to share using the link below or subscribe.

*Correction: The email edition of this newsletter left out Instana and Datadog. Instana seems to be using some eBPF functionality in technical preview according to their docs and Datadog is using a fork of Weaveworks’ TCP tracer and is using it their month-old Network monitoring solution.

Newsletter disclosure: Opinions my own. I am not employed, consulting, or an investor in any of the mentioned companies or their competitors.

Appendix: Deep Dives on eBPF Monitoring from different startup perspectives

Observability Pipelines at Kubecon

I learned about a new type of marketecture, and have a feeling it’s not going away anytime soon. It’s called the observability pipeline.

The term appears multiple times on the marketing pages of companies who exhibited in the start-up tier at Kubecon last week. This tier is the lowest-cost sponsorship and thought it’d be an interesting filter to learn about new (or thrifty) software companies. Out of the 74 startup exhibitors:

  • 10 sell something involving data or storage

  • 8 sell something involving security

  • 12 sell something involving Kubernetes cluster management, especially multi-cluster

  • 11 sell something that could be described as a monitoring or observability solution

Until recently, monitoring businesses could be categorized by the type of operational data their software collects: logs (Splunk) or metrics and traces (APM companies). This categorization still kind of works in late 2019:

A new category of monitoring startups, however, is companies that offer some kind of observability pipeline, an event-driven workflow for filtering and routing operational data. Think of it as intelligent software glue for teams with dozens of tools and services that deal with operational data, visualization and alerts.

The architecture of these pipelines can be found on the Cribl, Sensu, and NexClipper websites. You also see a similar concept without an ops-specific focus in Amazon’s EventBridge or Azure’s Event Grid. There’s also Stripe’s open-source Veneur. Everybody visualizes it in a similar way—data in, magic, data out:

A generic version of it could look something like this:

Implementing a monitoring pipeline opens up some interesting possibilities. A major use case is simplifying operational concerns around compliance and security while giving individual engineering or ops teams autonomy to extend the pipeline for different workflows, dashboards, analytics, and alerting without breaking things for others.

For example, imagine an ops team wants to have the capability to detect anomalies that could indicate “something very bad is happening” using bleeding-edge machine learning (ML). With a pipeline defined, they just have to make a change to an existing configuration and logs for specific apps are automatically re-routed a third-party service that specializes in ML (like Zebrium, another Kubecon exhibitor). If the data needs to be sanitized before sending off to the third-party, no problem: the team just attaches a new rule that removes PII data. There’s a lot of flexibility, including picking and choosing specialized tools to visualize and understand data.

Cribl, founded by ex-Splunk employees, also points out on their site that there’s potential for cost savings: you can implement policies in the pipeline so you see the most important data in real-time or convert logs into cheaper metrics. It’s still early, but emerging standards (see OpenMetrics, OpenTelemetry) could make integrating more data into pipelines easier as well.

Event-driven pipelines are not a new idea at all, but as far as I can tell it’s new to see pipelines as a distinct and highly configurable feature bundled inside of monitoring solution. If you squint, it almost looks like the beginning of control and data planes for observability.

Kubecon startups not doing pipeline stuff are equally interesting:

  • Humio, fresh off a Thoughtworks Tech Radar mention, and Chronosphere, a few weeks post-series A. Both seem to be building their products/special sauce around custom time series databases.

  • Many founders of observability startups over the past few years can describe themselves as “ex-[large technology company]”. Flowmill is what happens when the founder is a  Computer Science PhD from MIT’s CSAIL Lab and is an expert on some pretty interesting and bleeding-edge Linux kernel technology (eBPF).

  • Rookout seems seems incredibly cool from a developer perspective: click on a line of your code and get live production logs or metrics. They call it non-breaking breakpoints, and the demo on their homepage is impressive.

There’s a lot of really interesting stuff going on and want to get into the details of BPF and why it’s so fascinating in a monitoring context in the next newsletter.

Happy Thanksgiving this week for people based in the US. This post on using a bakery for explaining technology jobs to family comes in handy this time of year. 

Thanks for subscribing, and if you enjoyed this feel free to share with the link below.

Newsletter disclosure: Opinions my own. I am not employed, consulting, or an investor in any of the mentioned companies or their competitors.

Loading more posts…