Chaos, Complexity and Catalogs
Startups helping operate and understand complex systems through chaos engineering or service catalogs
Welcome to all the new subscribers who like databases. This is Monitoring Monitoring, a quasi-monthly newsletter about startups in the observability space. You can subscribe here.
Monitoring is most appreciated when something breaks. Despite a lot of marketing (and snake oil) around AI, it’s still up to human operators to find a fix.
The lessons from a physician in a resilience engineering cult favorite paper from the nineties still resonate with people building software today. Consider these two points:
12. Human practitioners are the adaptable element of complex systems.
18. Failure free operations require experience with failure.
The messy intersection of people, organizations, and complex (always failing) software is the focus of this newsletter. If we use monitoring, observability and dashboards to answer the question “what’s broken?”, here are two more questions:
How do we break it? (#18) and Who owns it? (#12)
Chaos as a Service, or “How do we break it?”
In 2011, Netflix had an idea to test a system’s ability to tolerate failures by deliberately injecting failures using a tool called Chaos Monkey. This was extremely influential, and nine years later, resilience and chaos engineering are about half of the sessions at SREcon in March and there’s now the equivalent of an Agile Manifesto for Chaos.
Gremlin (founded 2016) and ChaosIQ (founded 2017) are selling SaaS solutions to run chaos experiments and there is a long list of open-source tools in the space. My favorite project name is Bloomberg’s PowerfulSeal.
Gremlin and ChaosIQ have several integrations to run chaos experiments in different parts of the stack (like the network, cloud provider, or application). The idea, like Chaos Monkey, is you can deliberately create failure scenarios and learn what happened to build more resilient systems.
There are a couple different ways chaos solutions fit into monitoring (other using logs and dashboard to figure out what broke during an experiment). Gremlin has support to overlay chaos events on top of Datadog dashboards, while ChaosIQ has integrations for Prometheus and OpenTracing. The OpenTracing integration gets at a cool idea where running an experiment changes the observability of your system: when you start the experiment, detailed traces can be automatically collected.
Before you run chaos experiments, however, it’s a good idea to know what your services actually do and who owns them.
Actually Your Problem as a Service, or “Who owns it?”
There’s a very human problem related when an organization enters the late-stage microservice phase: it’s no longer clear from dashboards or internal documentation who owns a service or what a given service does—especially if the original owner was promoted (or left the company). Service naming can complicate understanding (see AWS Systems Manager Session Manager) and service metadata, in any case, are usually spread across clusters, data centers, and dashboards. There’s probably something clever to say about Conway’s Law here.
Enter OpsLevel, effx, Dive and GitCatalog. The key words on their marketing pages are “owned”, “tame”, “structured”, and “up-to-date”. Most provide integrations into different types of systems (clusters, CI/CD, alerting, messaging) so the service catalog updates automatically and doesn’t get stale on an outdated internal wiki page. According to Joey Parsons, founder of effx, the magic number where you start to feel human-to-services pain is around 20.
Detailed metrics and logs might be available for all your services, but these startups remind us that that doesn’t matter much if you can’t find the right person or documentation to interpret the data.
More systems, more human problems
The work these startups are doing is a humbling reminder that regardless of really interesting emerging technology in the observability space, there’s still a massive amount of work to be done to connect the right people with systems and help them deal with (and anticipate) failure scenarios.
The intersection of academics, developers, SREs and operators tackling these problems is one of the most interesting interdisciplinary groups thinking about what happens when humans meet software complexity. There is a very long list of hard problems related to understanding and operating the systems we build: “who owns it?” and “how do we break it?” are just two that involve a fast-growing number of companies, consultancies, and engineers.
Thanks for reading. If you enjoyed this newsletter, feel free to share using the link below or subscribe here.
Disclosure: Opinions my own. I am not employed, consulting, or an investor in any of the mentioned companies or their competitors.